J Chron Dis Vol. 32. pp. 427 to 440 Pcrpamon Press Ltd. 1979. Prmted in Great Bream

PREDICTABILITY HEART

OF CORONARY DISEASE

TAVIA GORDON,* WILLIAMB. KANNEL~and MAX HALPERIN*$

(Received in rerisedform

17 Nocemher 1978)

Abstract-In the past 30yr there has been an impressive increase in our ability to predict coronary heart disease (CHD). Some of the developments in epidemiology and statistics from which this derives are discussed. While it is not possible to specify a useful upper bound to the predictability of coronary heart disease, the paper provides guides to determining the effectiveness of various CHD risk functions, with illustrative examples from the Framingham Study, and discusses interpretative problems.

DURING the last 30yr a number of risk factors for coronary heart disease (CHD) have been confirmed, their relationship to CHD incidence in U.S. populations has been quantified, and efficient procedures for using the composite predictive information from these factors have been developed and tested. Of the core triad of CHD risk factors-blood pressure, serum cholesterol and cigarette smoking-only blood pressure was firmly established 30 yr ago. Given the strides that cardiovascular epidemiology has made since, it is appropriate to ask at this time how well we can actually predict the appearance of coronary heart disease on the basis of currently known risk factors. DEFINING THE PROBLEM Prediction can be defined in statistical terms inter ah as a problem in classification, hazard, or dose-response. While the latter two seem more reasonable than the first, as will appear from our discussion, the interconnections among all of them are so intimate that their separation must be attempted with some delicacy. Classijication

According to the classification model, a person belongs either to one population (CHD cases) or another (non-CHD cases). One way of approaching this is in terms of discrimination [l]. Each person has a set of specified values for a defined group of risk characteristics. Since the case and noncase populations are conceived as being mixed together at the beginning of followup, the problem is to combine the information from these risk characteristics to identify the population to which each individual will belong at the end of a fixed interval of followup. Perfect discrimination occurs when each and every person is correctly identified as a case or noncase. Clearly this is unrealistic. Once some risk is assumed-even a very low risk-some cases may develop and if the low-risk class is large enough some cases should be expected. Nor can there be absolute certainty that any person will develop CHD. The *Formerly with the National Heart, Lung, and Blood Institute. +Framingham Heart Disease Epidemiology Study, National Heart, Lung, and Blood Institute. iDepartment of Statistics, Biostatistics Center, George Washington University. Work partially supported by NIH grants HL15191-07 and CA1568GO4. 421

428

TAVIA GORWN, WILLIAMB. KANNEL and MAX HALPERIN

efficacy of the estimation should be subject not only to evaluation in terms of the fixed interval of followup but also to the test of experience in the next time intervals. Thus, if groups designated at low risk develop CHD at a low rate in the initial interval and also in subsequent intervals this constitutes a further validation of the estimation of risk status: and similarly for groups designated to be at high risk. Hazard

A simplified formulation of hazard is the following: Every person of a given age who is free of CHD will develop CHD sooner or later unless he dies of some other cause first. We are, then, concerned with the distribution of time-to-CHD, a continuous variable whose distribution is conceived to be conditional on certain characteristics or risk factors. This distribution is modified by the competing risk of death from non-CHD causes, which may intervene before CHD occurs. In considering the distribution of time-to-CHD, it is reasonable to define the predictive power of risk factors in terms of the variance of this distribution. A set of risk factors which accounts for a significantly larger proportion of the unconditional variance of time-to-CHD than another set is better. Such a formulation has certain difficulties. In principle, a hazard function encompasses the experience of a full life span but observations on which to base statistical estimates of such a function over such a range are not presently adequate. Furthermore. neither the distributions of time-to-CHD nor of time-to-non-CHD-death can be construed as static in the real world. Nonetheless, this formulation is conceptually attractive. Dose-response

A related approach is to consider the probability of developing CHD in a more limited fixed time span, conditional on a specified set of risk factors. This may be viewed as a dose-response relationship. Persons with low (or occasionally with high) doses of the risk factors are less likely to develop CHD in the specified time period; persons with high (or occasionally with low) doses are more likely to develop CHD. In this formulation, as in the hazard formulation, nobody is certain to either avoid CHD or develop CHD. Rather, the risk is considered to increase with the level of the specified risk characteristics. This is only an approximation of the real world but statistical functions based on this model (and particularly the logistic function) have been repeatedly shown to fit a wide variety of data sets reasonably well. Currently, the most commonly used function for predicting CHD in a fixed time interval, given certain characteristics or risk factors, is the logistic regression. In a regression mode1 the main test of relevance is the goodness of fit. If the regression of CHD incidence in a fixed time interval on a set of risk characteristics is computed, it is possible to rank the estimated probabilities of CHD from lowest to highest, and to divide the population into (say) deciles of risk. The agreement of the actual and expected number of cases in each decile can then be examined to assess the accuracy of the prediction. However, fit is not the only criterion for evaluating a risk function. It is also desirable that the risk slope be steep; that is, that only a small per cent of the cases fall in the lowest decile of risk (the smaller the better) and that a large per cent of cases fallin the highest decile of risk (the larger the better). This can be viewed as a classification test. Moreover, in principle, low risk can approach zero and high risk can approach unity. If a risk function can identify persons at very low and very high risk this would generally be considered desirable. On the other hand, if a single characteristic were ever discovered which either caused or prevented CHD with near certainty this characteristic should probably be evaluated separately, rather than included with other factors in a multivariate function. Whether the possibility of near-perfect prediction warrants serious consideration is moot, but in any event a focus on the very high and very low risk groups is entirely too narrow in terms of present knowledge. It does not take into account the groups

Predictability

of Coronary

Heart

Disease

429

of intermediate risk that ordinarily account for a large proportion of the CHD incidence. When these groups are considered, it becomes obvious that what is sought is not perfect. classification but a function that gives as steep a risk gradient as possible. Before attempting to explore some of these statistical issues in terms of specific data it is well to consider the impact of a number of sources of imprecision in the basic data.

SOURCES

OF

DATA

IMPRECISION

Caseyfinding The dependent variable under discussion, coronary heart disease, is the end product of a complicated, imperfectly understood process. A major component of the process is the progressive narrowing of the coronary arteries due to atherosclerosis. Autopsy studies indicate that some degree of coronary atherosclerosis is common in middle-aged American men and that this may account for the high incidence of CHD [2-41. However, at times the narrowing may be quite severe without any clinical manifestation. Alternatively, a clinical manifestation may sometimes occur with only minimal narrowing. Moreover, it is presumed that occasionally a thrombus arising from other processes may block a coronary artery, leading to myocardial ischemia with or without infarction. Arrhythmias may arise from these or other sources leading to sudden death. Under current rules of attribution such deaths are ordinarily ascribed to coronary heart disease although this is by necessity inferential. Nor is the diagnosis of clinical CHD certain. It is not uncommon to find evidence of an old myocardial infarction on autopsy that had never surfaced clinically [S-S]. Moreover, routine electrocardiograms will sometimes reveal a recent myocardial infarction in persons who have never evinced suggestive cardiac symptoms and would not have been discovered except for the routine electrocardiogram [S]. At the same time. the pathognomonic electrocardiographic evidence of myocardial infarction manifest acutely may revert to a nonspecific form with a relatively short time after the event [S, 9, lo]. If we add to these sources of imprecision the inevitable variation in electrocardiographic interpretation, in laboratory tests for enzymes used in diagnosing myocardial infarction, and in the presentation and interpretation of a clinical history, it is fairly clear that a true clinical event may sometimes go undiagnosed and events not due to coronary heart disease may be incorrectly attributed to this cause. While these sources of variation have not been precisely quantified, enough is known of their magnitude to indicate a non-trivial amount of misdiagnosis. Thus, even if we could correctly predict every true case of CHD, some of the cases would be missed and some of the noncases would be called cases. resulting in an apparently imperfect prediction.

Risk-factor

assessment

The independent variables, the so-called risk factors, have similar difficulties. None of them can be measured with certainty. All of them constitute indices of processes which are highly complex. For example, blood pressure is notoriously variable and responsive to a variety of internal and external stimuli. Total cholesterol is now known to include one component (in the low density lipoproteins) positively associated with CHD incidence and another (in the high density lipoproteins) negatively associated with CHD incidence. The metabolic processes determining the levels of each are poorly understood but apparently quite complicated [Ill, 121. It is likely that these processes are in fact deeply implicated in atherogenesis. The measurement of total cholesterol indexes this only in a highly abbreviated manner. It would therefore be naive to anticipate that the prediction of CHD is likely to approach perfection as the disease itself and the usual risk factors are currently defined and measured.

430

TAVIA GORDON.WILLIAM B. KANNELand MAX HALPERIN REGRESSION

ESTIMATES

The conditional probability of CHD in 2 yr given a set of risk characteristics is represented in this report by a logistic regression function in which the parameters are estimated by the method of Walker and Duncan [13]. Strictly speaking, this is a data summarization not prediction. Its relation to actual prediction will be discussed briefly later. The illustrations used come from the experience accumulated during the first 11 biennial examinations of the Framingham Study cohort [14] and are restricted to men 45-54 yr old at the beginning of each observation period. The risk factors considered are serum cholesterol, systolic blood pressure, cigarette smoking (yes, no), glucose intolerance (yes, no), heart enlargement on X-ray (definite or not), ECG-LVH (definite or not) and urine albumin (definite or not). Age and sex, which are also risk factors, are held substantially constant by restricting consideration to men 45-54. This roster of characteristics has only one thing in common: in persons free of CHD they are all associated with the risk that CHD will develop in the next 2yr. Serum cholesterol, blood pressure and cigarette smoking are alterable and are not manifestations of CHD. Glucose intolerance can be treated, but there is little reason to believe that the present therapies lower the associated CHD risk. Heart enlargement on X-ray or ECG-LVH is, in part, a manifestation of elevated blood pressure and in part a consequence of ischemic damage to the myocardium. Urine albumin may indicate hypertensive cardiovascular problems but it is not a specific manifestation of CHD. Age is not reversible, sex is ordinarily a fixed characteristic. Only serum cholesterol, systolic blood pressure and age are entered in these calculations as continuous variables. The others are treated as discrete variables. Criteria and measurement techniques are described in previous reports [lS]. Steepness of slope

There are several ad hoc statistics which may be used to measure the steepness of the risk gradient obtained by a logistic regression on a specified set of risk factors. For example, Menotti et al. [16] have proposed M, = f ci(i -a)‘/Nc i-1 as statistics, where ci are the number of cases observed in the ith decile, a is a constant (perhaps zero) and NC is the total number of cases. Somewhat simpler is the relative risk statistic obtained by dividing the number of cases in the upper decile (or quintile) by the number of cases in the lowest decile (or quintile)---Die/D, or Qs/Q1, in the usual shorthand. The more this number exceeds 1.0 the steeper the gradient. If D1, or Qi is zero this statistic is undefined, but that may be a warning that the total number of cases is too small for firm conclusions. A related statistic is the difference between the number of cases in the upper decile (or quintile) and in the lowest decile (or quintile), perhaps divided by the total number of cases. A ratio of zero implies that the risk factors were of no assistance in predicting CHD. Various combinations of the risk factors are considered in Table 1 and evaluated in terms of the criteria previously specified. Suppose we begin with systolic blood pressure and serum cholesterol. The sample ratio of Q5/Q1 is 2.60 for systolic blood pressure and 3.38 for serum cholesterol (Table 1). If we combine these two risk factors into a bivariate function the ratio of Q5/Q1 is 6.50. After adding cigarette smoking the ratio becomes 6.12 (clearly the sampling variability is fairly large if adding a risk factor decreases the sample ratio). Adding all the other variables, the ratio becomes 8.67. There is a special interest in what goes on in the upper and lower ends of the risk scale. Considering only the estimates of the number of cases in the upper decile of risk, it is clear in this instance that once two significant risk factors are considered, the incremental gain in the number of cases appearing in the high risk group is quite

Predictability

of Coronary

Heart

Disease

431

small as additional risk factors are added. What does happen, however, is that more cases come to be shifted from the lower end of the predicted risk scale to the upper. This expresses itself in an improvement in correct identification of both low and high risk persons and a noticeable increase in the ratio Qs/Qi. However, after the first two variables (and almost any two variables of roughly equal strength would serve to illustrate the same point), adding additional variables increases the ratio rather slowly. This phenomenon is also frequently observed in multivariate linear regression. GOODNESS

OF

FIT

If the regression of CHD incidence in a fixed time period conditional on some risk factor or factors is estimated, several criteria may be considered for determining how well the estimated regression fits the sample data. As Efron has observed there is no formally best criterion when the response variable is dichotomous, but one good criterion is the likelihood ratio statistic [17]. This may be used to assess whether the data are fitted significantly better by taking the risk factors into account than by assuming everyone to have the same risk. Another is to order the population by the magnitude of the estimated risk, dividing it into (say) 10 groups along the risk gradient and then comparing the actual and expected number of cases in each decile of risk by means of the usual x2 goodness of fit statistic. It is difficult to define the exact sampling characteristics of this statistic in this case beyond saying that a smaller number is better than a larger number. A ,third method for assessing goodness of fit is to calculate the average proportion of the variance of the probability of developing CHD accounted for by regression. This is a more complicated concept than may appear at first blush. ‘Proportion of variance explained’ is a statistical property of regression estimates that is particularly useful when dealing with data well described by a multivariate normal distribution. In the multivariate normal case, the usefulness of the conditional expectation of y given )ci, . . , x, can be evaluated by R2, the multiple correlation coefficient. It is conceivable that R2, which is sometimes described as the proportion of variance of J’ explained by xi,. . . , x,, can attain lOOO;,. In fact, instances where R2 approaches lOO(,‘;,are actually encountered in practice. Even in the multivariate normal case the phrase ‘proportion of variance explained’ sometimes is taken to mean more than it says. The proportion of variance explainedwhether it is large or small-does not cast any light on whether the variables considered are explanatory in a mechanistic sense or whether there are other important factors involved. Even where R2 is non-trivial it is entirely conceivable that a completely different set of factors could be found that had an equal ‘explanatory’ power to those under consideration. When the response variable is dichotomous, as in the case of CHD, additional interpretative problems arise. In this case 4’ takes on the values of zero or one (occurrence or non-occurrence of CHD in a specified period of followup). Furthermore, the conditional expectation of r: given xi, . . . , xs, say p(x), is a non-linear function of xi, . , x,. In such a context it is theoretically conceivable that the average proportion of variance of J’ explained by x1, . . , x, can reach lOO’A, but this can occur only in certain degenerate situations. Two such situations are cited in the appendix, the more interesting being the case where for some specified constellation of values of the risk characteristics a person is certain to develop CHD, i.e. p(xo) is one and for all ot.her constellations a person is certain not to develop CHD. We believe that such cases are of no interest in the real world. Rather we believe that the distributions of p (the CHD risk) ordinarily encountered are likely to be similar to the distribution of conditional probabilities from the seven variable function described in Table l-a right-skewed unimodal continuous distribution with a very long tail. Unfortunately this assumption is not subject to direct verification, since only estimated conditional probabilities can ever be known and altering the set of independent variables tends to alter the distribution of estimated conditional probabilities. The assumption of right-skewness implies that the mean of this

13.89

14

D 10 - DI

26

26.92

9.89

24

37.26

29

44.09

11.34

30

47.89

5.90

3 4 8 8 15 13 18 18 18 33 12.00 7.29

4 4 9 5 16 16 12 23 16 33 8.25 6.12

3 5 8 13 11 11 16 19 25 21 9.00 6.50

6.54

(1) + (2) + (3) + (4)

.of cases

(1) + (2) + (3)

number

(1) + (2)

Actual

34

55.22

10.30

2 4 6 10 14 14 21 15 16 36 18.00 8.67

(1) - (7)

number

(6) LVH

(7) of the Framingham quintile of risk, Qr

5.3 6.9 8.0 9.2 10.5 11.9 13.6 16.1 20.1 36.6 6.87 4.65

(1) - (7)

by electrocardiogram,

2X.8

5.4 7.0 8.2 9.4 10.7 12.2 14.0 16.5 20.4 34.2 6.54 4.40

variables as estimated from the experience of Walker and Duncan. QS is the highest

by X-ray,

21.7

5.6 7.2 8.4 9.6 10.9 12.4 14.2 16.6 20.5 33.3 6.01 4.22

(1) + (2) + (3) + (4)

of cases

(138 EVENTSOCCURRING PRIOR TO EXAM 11)

(1) + (2) + (3)

enlargement

6.2 7.9 9.0 10.1 11.3 12.6 14.2 16.3 19.8 31.4 5.05 3.63

(1) + (2)

(5) Heart

on the specified by the method

intolerance,

20.9

7.1 8.9 10.0 10.9 11.9 13.0 14.4 16.0 18.6 28.0 3.91 2.91

x.9 10.0 10.8 11.4 12.4 13.1 14.2 15.3 17.6 24.5 2.15 2.22

15.6

(2)

(1)

Expected

BY LEVELOF RISK AS ESTIMATEDBY SPECIFIEDRISK PROFILES:FRAMINCHAM STUDY, MEN 45554

Note: (1) Systolic blood pressure, (2) Serum cholesterol, (3) Cigarette smoking, (4) Glucose Serum albumin. *Persons are ranked according to their probability of developing CHD in 2yr conditional Study during the first 20yr of followup using a logistic function. Parameters estimated the lowest. tFrom the actual and estimated number of cases in each decile of risk.

9.17

5 8 1 9 20 15 15 15 13 31 6.20 3.38

9 6 10 9 10 12 21 22 16 23 2.56 2.60

likelihood ratio

(2)

(1)

xv

QslQ,

D,,/Dr

1 2 3 4 5 6 7 8 9 10

Decile of risk*

TABLE 1. DISTRIBUTIONOF CHD

Predictability

of Coronary

Heart

Disease

433

distribution (p) will be to the right of the mode (p,), i.e. pm < p. Since the distribution the mode is defined as the value of p with the of p is assumed to be continuous, largest probability density. This expectation can be and should be checked against the data in any particular case if the following arguments are to be used. The formal derivation of upper bounds for the average proportion of variance explained is given in the appendix. It is demonstrated in the appendix [in the discussion leading to formula (25)] that under the assumptions that p is distributed according to a continuous density f(p) on 0 I p I 1, that f(p) is unimodal, (i.e. has a single peak), and that p < l/2, then the upper bound for the average proportion of variance explained for fixed pm and p is ~(2/3 - Is) - F(l

- 2p)

PU-P) The maximum

over all values

of pm is attained

when pm = 0 and will be

(213 - ~)i(l - P). In the examples given in this paper, p = 0.02, so that the theoretical maximum is 66%. However, pm is unlikely to equal zero, so this is a somewhat excessive bound. On the other hand, since pm < p, the upper bound cannot be less than 33”/,. Assuming such a large class of admissble distributions has, not surprisingly, led to a very large upper bound. The bound can be dramatically lowered for special cases. Morrison has pointed out that f(p) for such situations as we are interested in may often be quite well represented by a beta function [18]. Under reasonable assumptions [explained in the appendix in the discussion preceding formula (1 l)] f'(p)will be unimoda1 and the average proportion of variance explained for a given pm and p will be ii - Pm 1 -3p,+p’ In this case the maximum over all values of pm occurs at pm = 0 and will be p/(1 + p). When p = 0.02 this maximum will be 1.96%. Clearly, then, the upper bound for the proportion of variance explained depends on the assumptions made respecting f(p). The average proportion actually explained may be estimated by [pq - (Zpiqi/n)]/~pq, where 4 is 1 - j and the pi and 4i = (1 - pi) are conditional estimates of risk for each person. The average percentage of variance explained by the various risk functions given in Table 1 range from 0.11% (for systolic blood pressure) to 1.18~4~when all seven variables are entered. Whet her 1.18% is good or terrible depends on whether the maximum attainable given p = 0.02 is taken as 1.96, 33 or 66yb. Since there is no compelling argument for one or the other or some intermediate value it is difficult to see how the average percentage of variance explained by a specific risk function is likely to assist us in evaluating that function. The hazards of arguing analogically from the multivariate normal case were never more clearly illustrated than in this instance.

OTHER

RISK

FACTORS

The risk factors chosen include the ones commonly identified as the key characteristics upon which one could intervene as well as some adjunct characteristics. Others, of course, are known or suspected. Rather than total cholesterol it is now recognized that a split into the HDL and LDL components considerably sharpens the lipid characterization [ll, 121. Other electrocardiographic abnormalities than LVH are identified as CHD risk factors 1151. The presence of stroke or peripheral vascular disease are known to betoken increased CHD risk [19]. In women, the menopause marks a doubling of CHD risk [20]. Furthermore, the specific manifestation of CHD must sometimes be considered. For example, cigarette smoking is a risk factor for ‘heart attacks’ but

434

TAVIA GORDON,WILLIAMB. KANNELand MAX HALPERIN

probably not for angina pectoris [15]. The impact of some risk factors varies by sex. Glucose intolerance is a particularly strong risk factor for CHD death in women, less so for men [21]. This article does not attempt to explore these details. Clearly there is evidence now available which can be used to increase the specificity of CHD prediction by age, sex and manifestation of CHD. Is there more to learn? Certainly. Leaving aside new risk factors, it is likely that currently identified risk factors will themselves yield additional information on more careful study or when employed more rationally. Consider, for a moment, the fact that a single blood pressure determination is by itself predictive of CHD. If it be granted that CHD arises from a long-term process, then it must be that a single measurement of blood pressure reflects that process only approximately. We know that a more reliable assessment of blood pressure, based on the average of a number of measurements in a short period of time, leads to improved prediction [22]. The fact that the pathological consequences of blood pressure are in any way predicted by one measurement suggests either that the influence is very powerful or that a single measurement is highly correlated with all other measurements of blood pressure-or both. Two groups with the same average blood pressures at one point in time would presumably have different CHD risks if they had previously had vastly different average blood pressures at another point in time. In comparing different populations this must always be regarded as a possibility for all risk factors. In principle, of course, we should be able to improve prediction considerably if the lifetime configuration of risk characteristics up to the beginning of followup were known. In practice, however, we do not have the lifetime configuration nor do we have a method of using the information if it were available. Thus, it must be’ anticipated that we will never be able to entirely explain population differences in terms of a CHD risk function. PREDICTION

What has been discussed up to this point is the relationship of specified characteristics in persons free of CHD to the subsequent development of the disease, given the knowledge both of the values for the baseline characteristics and the outcome events. The logistic parameters so estimated are, therefore, intended to fit the specific data. This is description rather than prediction. On the other hand, we would hardly be interested in such estimates if we did not expect that they would apply reasonably well to other experience. The sampling errors estimated along with these parameters are one gauge of the confidence we can place in the estimates. Another is our experience with actual prediction. By using parameters estimated from one set of Framingham data to predict who will develop disease in another part of the Study experience, we have found that we do as well in ‘predicting’ as in describing. The following example may serve as an illustration: A multiple logistic function for 2-yr CHD incidence given the variables systolic blood pressure, serum cholesterol, cigarette smoking and glucose intolerance was calculated for men 45-54 at exam from the first 10yr experience. The estimated parameters for this logistic were then applied to the next 10yr experience to predict the occurrence of CHD among men 45-54 at exam during the second 10yr. The results are given in Table 2. Agreement between actual (observed) and expected (estimated) numbers of cases by decile of risk was as good in the predicted series as in the fitted series of cases and there was only a small degradation of the risk slope. The observed Q5/Q1 was 8.0 for the first 10 yr (fitting) and 5.8 for the second 10 yr (predicting). The expected ratios for the first and second 10yr were 6.8 and 5.5, respectively. The only respect in which prediction faltered was in the total number of cases estimated to occur. It was predicted that 78 cases would occur and only 70 did. This discrepancy, however, is probably well within reasonable chance variation. Another test of predictive efficacy is the following: Using parameters of the logistic function estimated from the first 10yr experience for men 45-54 (the same estimates

435

Predictability of Coronary Heart Disease TABLE 2. DISTRIBUTIONOF CHD BY LEVELOF RISK IN TWO

Decile Of

risk

PERIODS:FRAMINGHAMSTUDY, MEN 45-54*

Number of CHD cases in second 10 yr (predicted) Expected Actual

Number of CHD cases in first 1Oyr (fitted) Expected Actual

1 2 3 4 5 6 7 8 9 10 DroiD,

2 1 2 2 9 9 9 10 4 20 10 8

QslQl I2 D 10 -

lo-yr

2.39 3.21 3.19 4.42 5.09 5.95 6.94 8.10 10.15 17.96 7.5 5

1 3 6 6 6 4 6 15 11 12 12 5.8

15.6

11

2.61 3.44 4.16 4.86 5.65 6.59 7.75 9.40 12.10 21.08 8.1 5.5

13.34 D,

18

10.91 18.5

*See notes to Table 1. The variables used in the logistic are (l)(7).

used in the previous example) a 2yr probability of CHD for each man 45-54 taking Exam 6 was calculated, given his characteristics at Exam 6. Men were then ranked according to this probability, from low to high risk, and divided into 10 equal-sized groups. The question can then be asked: What was the CHD experience for each of the next five 2-yr intervals in each of these risk groups? The ratio of the actual number of cases in Qs/Q1 (the quintiles being defined on the basis of the first five examinations) were as follows: Exam 62.0, Exam 7-11.0, Exam 8-9.0, Exam 9--8.0, Exam 10-4.0 (Table 3). Since there were relatively few cases in each 2-yr interval the sampling variability is quite high. Nonetheless it is clear that the contrast in CHD incidence between high and low risk groups persisted well beyond the 2 yr that followed the examination at which these men and their risk function were characterized. We would conclude from these examples, first, that the risk that CHD will occur in a specified, fixed time span can truly be predicted from a standard risk profile derived from characteristics measured before the event. The second conclusion is that not only is the prediction verifiable over the specified time span but the contrast in CHD incidence between persons designated as at high risk and those designated at low risk persists well beyond the limited time span for which the prediction was intended. The preceding examples refer to prediction within the Framingham Study. The Framingham risk function estimates have also been used to predict heart attack experience TABLE 3. DISTRIFKJTION OF CHD BY LEVEL OF RISK IN SUCCESSIVE BIENNIALINTERVALS:FRAMINGHAMSTUDY, MEN 45-54 AT EXALI 6

(1st) Decile of risk* 1 2 3 4 5 6 7 8 9 10

&IQ,

0 2 0 2 1 1 2 4 3 1 2

Biennium (2nd) (3rd) (4th) Number of CHD cases 1 0 0 1 0

1 3 4 3 8 11

0

1 I 2 2 2 0 1 5 4 9

0 1 2 2 1 0 4 4 1 1

8

(5th) 2 1 2 1 3 3 4 2 7 5 4

*Based on logistic parameters estimated for first 10yr (see Table 2) applied to characteristics measured at the beginning of the second 10 yr. Assignment to decile of risk does not change after first 2 yr.

436

TAVIA GORDON, WILLIAM B. KANNEL and

MAX HALPERIN

in other populations of middle-aged white men in the continental United States [22]. In such populations prediction has also proved highly effective. From the Framingham experience it was possible to identify high and low risk groups and predict their actual CHD incidence rates with considerable accuracy. On the other hand we do not find this to be equally true when the geographic area is enlarged. Framingham experience for white middle-aged men predicts twice as many heart attacks as are actually occurring in Puerto Rico and in Honolulu Japanese men [24] and three times as many as are observed in Yugoslav urban men [25]. True, the Framingham experience does a good job of distinguishing between men at high risk and men at low risk within each of these populations, but the absolute level of risk is not well predicted. Similar observations have emerged from the Seven Countries Study [26]. Also, in predicting CHD for women from risk functions for men it has been observed that the actual and predicted levels differ substantially [27]. These facts indicate that our ability to predict CHD has room for improvement. On a priori ground this is to be expected even if we had identified all the pertinent risk factors for CHD. In populations with similar life histories each of the risk factors characterized at one point in time provides a reasonable index to levels for that characteristic at other points in time. For populations with different life histories this is not necessarily true. Thus, a cholesterol value obtained in Puerto Rico or Hawaii at one point in time may be incomparable to that obtained in Framingham if the former populations have been experiencing a rapid rise in their lipid values. CONCLUSION

In the past 30yr there has been an impressive increase in our ability to predict coronary heart disease. This has arisen both from new epidemiological research and from the development of new statistical techniques. Some of the issues arising from these new developments have been discussed. While the discussion has revolved around the predictability of coronary heart disease, it is clear that the same issues arise in predicting other chronic diseases. The problems must, inevitably, be phrased in the context of current analytical procedures and in terms of our current conceptualization of the risk factors and the disease. It is likely, however, that our conclusions will continue to apply with minor modifications so long as the disease itself is viewed as a discrete event. Our formulation of the issues is probabilistic in nature. We have argued that if predictability is viewed in such terms rather than deterministically, it is not reasonable to anticipate complete predictability nor even to specify a convincing upper bound to predictability in any particular instance. It is possible, however, to determine in less absolute terms how well a risk function performs and to compare the effectiveness of one risk function with another. This paper provides some guides for doing this. Acknowledgemenfs-We discussions.

wish to record

our thanks

to Nathan

Mantel

and Dr, John

Hyde

for some

useful

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.

Cornfield J, Gordon T, Smith WS: Quanta1 response curves for experimentally uncontrolled variables. Bull IS1 38 (3): 97-115, 1961 Enos WF Jr, Holmes RH, Bayer JC: Coronary disease among US soldiers killed in action in Korea; preliminary report. JAMA 152: 109@~1093, 1953 Mason JK: Asymptomatic disease of the coronary arteries in young men. Brit Med J 2: 12341237. 1963 Rigal RD, Lovell FW. Townsend FM: Pathological findings in the cardiovascular systems of military flying personnel. Amer J Card 6: 19-25, 1960 Kannel WB. McNamara PM, Feinleib M. Dawber TR: The unrecognized myocardial infarction: 14 year follow-up experience in the Frammgham Study. Geriatrics 25: 75-87, 1970 Roseman MD: Painless MI: review of literature and analysis of 220 cases. Ann Int Med 41: 1-8, 1954 Johnson WJ, Achor RWP, Burchell HB, Edwards JE: Unrecognized MI. Arch Int Med 103: 253-261. 1959 Melichar F, Jedlicka V, Havlik L: A study of undiagnosed MIS. Acta Med Stand 174: 761-768. 1963 Paton BC: The accuracy of diagnosis of MI. Amer J Med 23: 761-768, 1957

Predictability

10. 11. 12.

13. 14

15.

16. 17 18 19.

20. 21 22 23

24 25 26

27 28

of Coronary

Heart

431

Disease

Levine HD, Phillips E: An appraisal of the newer electrocardiography: correlations in 150 consecutive autopsied cases. New Eng J Med 245: 833-842, 1951 Miller GJ, Miller NE: Plasma-high-density-lipoprotein concentration and development of ischaemic heart-disease. Lancet 1: 1617, 1975 Gordon T. Castelli WP, Hjortland MC, Kannel WB: The prediction of coronary heart disease by highdensity and other lipoproteins: an historical perspective. In: Hyperlipidemia: Diagnosis and Therapy. Rifkind BM, Levy RI (Eds.) New York: Grune and Stratton, 1977, pp. 71-78. Walker SH. Duncan DB: Estimation of the probability of an event as a function of several independent variables. Biometrika 54: 167-179, 1967 Gordon T, Kannel WB: The prospective study of cardiovascular disease. In: Stewart GT (Ed): Trends in Epidemiology: Applications to Health Service Research and Training. Springfield, Illinois: Charles C. Thomas, 1972, pp. 189-211. Shurtleff D: Some characteristics related to the incidence of cardiovascular disease and death: Framingham Study, Is-year followup. In: The Framingham Study. Kannel WB, Gordon T (Eds). DHEW Pub. No. (NIH) 74-599. Washington DC: U.S. Govt. Printing Office, 1974 Menotti A, Capocaccia R. Conti. S, et al.: Identifying subsets of major risk factors in multivariate estimation of coronary risk. J Chron Dis 30: 557-565, 1977 Efron B: Regression and ANOVA with Zero-One data: Measures of residual variation. JASA 73: 113-121, 1978 Morrison DG: Upper bounds of correlations between binary outcomes and probabilistic predictions. JASA 67: 68.-70, 1972 Gordon T, Sorlie P. Kannel WB: Coronary heart disease, atherothrombotic brain infarction, intermittent claudication-a multivariate analysis of some factors related to their incidence. In: The Framingham Study. Kannel WB. Gordon T (Eds). 426: 130/1345. Washington DC: U.S. Govt. Printing Office, 1971 Gordon T, Kannel WB, Hjortland MC, McNamara PM: Menopause and coronary heart disease; The Framingham Study. Ann Int Med. 89: 157-161, 1978 Garcia MJ, McNamara PM. Gordon T, Kannel WB: Morbidity and mortality in diabetics in the Framingham population. Sixteen year follow-up study. Diabetes 23: 105-l 1I, 1974 Gordon T. Sorlie P, Kannel WB: Problems in the assessment of blood pressure: The Framingham Study. Int J Epid 5: 327-334, 1976 McGee D, Gordon T: The results of the Framingham Study applied to four other IJS-based epidemiologic studies of cardiovascular disease. In: The Framingham Study, Kannel WB, Gordon T (Eds). DHEW (NIH) 761083. Washington DC: U.S. Govt. Printing Office, 1976 Gordon T. Garcia-Palmieri MR, Kagan A. Kannel WB, Schiffman J: Differences in coronary heart disease in Framingham, Honolulu and Puerto Rico. J Chron Dis 27: 329-344, 1974 Kozarevic D, Pirc B, Racic Z, Dawber TR, Gordon T, Zukel WJ: The Yugoslavia Cardiovascular Disease Study-2. Factors in the incidence of coronary heart disease. Amer J Epid 104: 133-140, 1976 Keys A, Aravanis C, Blackburn H, van Buchem FSP, Buzina R, Djordjevic BS, Fidanza F, Karvonen MJ, Menotti A, Puddu V, Taylor HL: Probability of middle-aged men developing coronary heart disease in five years. Circulation 45: 815-828. 1972 Gordon T: Coronary heart disease in young women: Incidence and epidemiology. In Coronary Heart Disease in Young Women. Oliver MF (Ed). Edinburgh: Churchill Livingston, 1978, pp. 12-23 Kimball AW: On dependent tests of significance in the analysis of variance. Ann Math Stat 22: 60&602. 1951

APPENDIX Average

Proportion

of Variance Explained

(API/E)

when responses

are dichotomous

To clarify ideas, it is useful to make an analogy with multivariate normal regression theory. In multivariate normal theory, one assumes random sampling of, say, (y, xi,. , xp), from a multivariate normal population with mean vector ({ic,, p’) and varianceecovariance matrix, x:, say. It is then well known that the conditional expectation of y given x1. . xD is E(y(r,,u,

. . . . . .Yr) = PO+ f

&Xi

i

(1)

I

where /I0 is a function of (p,, be’) and the elements of Z while the 8, are functions only of the elements of Z. Furthermore, the conditional distribution of y is univariate normal with mean E(J\ x1,. _.xp) and variance a;(1 - RI) where uf is the unconditional variance of y and R2 is the multiple correlation coefficient between y and (x1, , xp). R2 is frequently referred to as the proportion of variance (of j) explained by Y, . xp. More generally we may define the average proportion of variance explained (APVE) by APVE

unconditional = ~___~~~

variance

of y - average

unconditional

conditional

variance

of j

variance

of y (2)

In multivariate normal theory. the modifier ‘average’ is not necessary since the conditional variance of !’ is constant. In the present context, a basic assumption is still that (r, x , , x2,. , xp) is a random sample from some distribution. In addition, it is assumed that y takes on the values zero or one and that E(ylx I,XZ,.

.Xp) = &lx) (3) = P(X)

438

TAVIAGORDON,WILLIAMB. KANNELand MAX HALPERIN

. xp such as the logistic function. Moreover, the condi-

where p(x) may be a non-linear function of x,. x2,. tional variance of y is

EiCY- PcmX~

= PCMX).

(4)

where q(x) = 1 - Ax). The unconditional mean of y is Ep(x) = p. say, while the unconditional variance is E(y - ii)* = CCL. - p(x)1 + [P(X) - PI 1’ = E[y - p(x)]’ = Ep(x)q(x)

+ Ep’(x)

+ E/?(x)

- 0’

(5)

- pz

= py. Thus the average proportion of the variance of y explained by x,, x2..

, xp is given by

/=Jq- Ep(x)q(x)

(6)

P4

It seems unlikely from (6) that all of the variance can be explained. There are situations, however, when the numerator of (6) is zero. One of these is when, even though there is a non-degenerate distribution of x, p(x) is either zero or one, indifferently of x. The only other situation of this kind appears to be when p(x) is one for x = x0 and p(x) = 0, otherwise. Neither of the extreme situations seem likely to occur in the real world. In particular, with respect to coronary heart disease outcomes, one expects interval rates to be on the average quite modest, perhaps in the range of l-15%. Further, one anticipates relatively few individuals with extremely high interval risks. These comments suggest that we anticipate a right-skewed distribution on the interval (0, 1), with a very long tail. It may be reasonable as well to anticipate that the distribution of risk is uni-modal. These expectations can and should be checked against data in any particular case. In any event, it is of interest to ask to what extent the average proportion of variance explained is limited by assumptions of the type described above. It is desirable to investigate this issue for fairly general probability densities of the variable, p. but first we will be more specific and assume the probability density of p is a Beta function density given by

m +B+(l

_ p)‘-’

,

O 1, it has already been noted that (7) is uni-modal. The modal value of p, pm say, is easily obtained by maximizing (7) with respect to p. One finds pm=

-7

X-l

1+8-2

Transforming from I and B to p and pn, we calculate (9) to become P - Pm

1 - 3p, + p

(11)

while the derivative of (11) with respect to p,,, is -(I

- 2P)

(1 - 3p, + p)* From (12), it follows that for given p < f, the maximum average proportion for pm = 0; i.e. it is given by P

E-j?’ On the other hand, if p > f. it follows that the maximum average proportion for p,,, = 1

of variance explained occurs

(13) of variance explained occurs

and is given by 1 -p

2pp’ Finally, if p = f, the average proportion directly

(12)

(14)

of variance explained is f independently of pm (as may be computed from (1 1), (13), or (14) with p = f) and is the maximum average proportion of variance explainable

Predictability

of Coronary

Heart

439

Disease

for any distribution of p well described by a Beta distribution with parameters r*, fi both greater than 1. Equation (11) is, of course. of interest in itself. It seems at least reasonable that a Beta distribution such as described is a good approximation to many uni-modal continuous (or discrete) distributions of p. But it is unlikely that the result assuming a Beta distribution of p will hold in every instance, especially for discrete distributions involving very few values. As an illustration, suppose: p assumes values p. and p1 (0 5 p. -c p1 < 1) with probabilities n and (1 - n) respectively and that np, + (1 - n)p, = p. Elementary calculations show that, for this case, AP”E

=

@

-

PONPI- r)

(151

m and that. furthermore.

the maximum

for p. i p I p,.

of APVE

is simply (16)

(t Pl~O - \ Poq,?. If one supposes p. = 0, APVE is, at most, p1 ; if p, = 1. APVE is at most qO. case. APVE can, in principle, be quite large. It would be of interest to calculate max unimodal distribution of p with k( > 3) classes, but we have been unable to do this. We can. however. as implied earlier, get results similar to (11). (13) and (14) under that p is distributed according to a continuous density f(p) on 0 s p i 1. and that f(p) To derive these results we first write (6) as

Thus, in this exteme APVE for a discrete the sole assumption is unimodal.

(17) and notice we can write (17) as

where p, is the value of p for which fh) we can write it as

achieves

its single mode. Concentrating

Now we call on the following Theorem. Let X be a random [A,(X) 2 0] be strictly monotone

in

(18).

L

Pm

(I, - P&(P)

s0

only on the integral

dp +

s Pm

(P - P,)%P)

(19)

dp.

variable with probability density g(X), 0 _< X < ‘I, decreasing [strictly monotone increasing] with finite Eh, (X)h,(X)

< Eht (X)Eh,(X).

This theorem is an easy generalization of a theorem Considering the two integrals in (19) separately, decreasing and non-negative whilef(p) is non-negative on (0. p,). we have by the theorem

and let h,(X) 2 0, expectations. Then (20)

appearing in Kimball 1281. we first observe that in (0. p,), (p - P,)~ is monotone and monotone increasing. Taking g(p) to be the uniform

or Pm

I0 (P where F(p,) is the cumulative By a similar argument

distribution

Pm)*f(p)dp< $F(p

-

)

‘3

of p. evaluated

function

(21)

m

at p = p,

1 s Pm One can then write an upper

@ - p,)‘f(p)

bound

-1 Pi _jyp pq [ 3

dp < (k3!@

[I

-

(22)

F(p,)]

for (18) as ) +

m

(Lz!t

p.)‘l -

3

F(I,

m)’ I

_

(p

_

p

m)2

1

(23)

The bound (23) does not take into account that we are implicitly fixing the mean. p. To take this into account, we observe that the bounding distribution implied by the application of (20) is a mixture of two uniform distributions, one with weight F(p,) on (0, p,). the other with weight 1 - F(p,) on (p,. 1). The requirement that the mean is fixed at p thus leads to F(P,)=1-2p+p,. An analysis of 1241 based on the requirement (a) if p i 4. 0 i p, 5 2p. (b) if p > ), 2p - 1 5 pm 2 1.

0 < F(p,)

(24) 5 1, leads to the requirements:

440

Substitution

TAVIA GORDON, WILLIAM B. KANNEL and MAX HALPERIN

of (24) into (23) leads after some simplification fi) - p, - !?(l

to - 2p)

li(l--P)

(25)



as a bound for APVE. We observe that (25) is monotone decreasing in P,,, for p < f, monotone increasing in pm for p > f and takes the value l/3 for p = f. It follows immediately. if I, < :, that the maximum of (25) for variation in P,,, is given by 0 - P)i(l - F) while for p r -5. (25) assumes

a maximum

c4 - (1 -

It

is also of interest given by

to note

that

(26a)

at p, = 1. A little re-arrangement

shows

this maximum

is

PM1 - (1 - al-

for p < f. the minimum

Wb)

of (25) [the

min-max]

is at

pm= 2~ and is

li

(27a)

3(1 -p)’ Similarly.

for p > i, the min-max

occurs

when pm = 213 -

1 and is given by

(27b)

We observe that. for p c f. (27a). the min-max, knowledge of pm.

will be less than

(13). This

underlines

the importance

of

Predictability of coronary heart disease.

J Chron Dis Vol. 32. pp. 427 to 440 Pcrpamon Press Ltd. 1979. Prmted in Great Bream PREDICTABILITY HEART OF CORONARY DISEASE TAVIA GORDON,* WILLIAM...
1MB Sizes 0 Downloads 0 Views