Abstract-In the past 30yr there has been an impressive increase in our ability to predict coronary heart disease (CHD). Some of the developments in epidemiology and statistics from which this derives are discussed. While it is not possible to specify a useful upper bound to the predictability of coronary heart disease, the paper provides guides to determining the effectiveness of various CHD risk functions, with illustrative examples from the Framingham Study, and discusses interpretative problems.

DURING the last 30yr a number of risk factors for coronary heart disease (CHD) have been confirmed, their relationship to CHD incidence in U.S. populations has been quantified, and efficient procedures for using the composite predictive information from these factors have been developed and tested. Of the core triad of CHD risk factors-blood pressure, serum cholesterol and cigarette smoking-only blood pressure was firmly established 30 yr ago. Given the strides that cardiovascular epidemiology has made since, it is appropriate to ask at this time how well we can actually predict the appearance of coronary heart disease on the basis of currently known risk factors. DEFINING THE PROBLEM Prediction can be defined in statistical terms inter ah as a problem in classification, hazard, or dose-response. While the latter two seem more reasonable than the first, as will appear from our discussion, the interconnections among all of them are so intimate that their separation must be attempted with some delicacy. Classijication

According to the classification model, a person belongs either to one population (CHD cases) or another (non-CHD cases). One way of approaching this is in terms of discrimination [l]. Each person has a set of specified values for a defined group of risk characteristics. Since the case and noncase populations are conceived as being mixed together at the beginning of followup, the problem is to combine the information from these risk characteristics to identify the population to which each individual will belong at the end of a fixed interval of followup. Perfect discrimination occurs when each and every person is correctly identified as a case or noncase. Clearly this is unrealistic. Once some risk is assumed-even a very low risk-some cases may develop and if the low-risk class is large enough some cases should be expected. Nor can there be absolute certainty that any person will develop CHD. The *Formerly with the National Heart, Lung, and Blood Institute. +Framingham Heart Disease Epidemiology Study, National Heart, Lung, and Blood Institute. iDepartment of Statistics, Biostatistics Center, George Washington University. Work partially supported by NIH grants HL15191-07 and CA1568GO4. 421



efficacy of the estimation should be subject not only to evaluation in terms of the fixed interval of followup but also to the test of experience in the next time intervals. Thus, if groups designated at low risk develop CHD at a low rate in the initial interval and also in subsequent intervals this constitutes a further validation of the estimation of risk status: and similarly for groups designated to be at high risk. Hazard

A simplified formulation of hazard is the following: Every person of a given age who is free of CHD will develop CHD sooner or later unless he dies of some other cause first. We are, then, concerned with the distribution of time-to-CHD, a continuous variable whose distribution is conceived to be conditional on certain characteristics or risk factors. This distribution is modified by the competing risk of death from non-CHD causes, which may intervene before CHD occurs. In considering the distribution of time-to-CHD, it is reasonable to define the predictive power of risk factors in terms of the variance of this distribution. A set of risk factors which accounts for a significantly larger proportion of the unconditional variance of time-to-CHD than another set is better. Such a formulation has certain difficulties. In principle, a hazard function encompasses the experience of a full life span but observations on which to base statistical estimates of such a function over such a range are not presently adequate. Furthermore. neither the distributions of time-to-CHD nor of time-to-non-CHD-death can be construed as static in the real world. Nonetheless, this formulation is conceptually attractive. Dose-response

A related approach is to consider the probability of developing CHD in a more limited fixed time span, conditional on a specified set of risk factors. This may be viewed as a dose-response relationship. Persons with low (or occasionally with high) doses of the risk factors are less likely to develop CHD in the specified time period; persons with high (or occasionally with low) doses are more likely to develop CHD. In this formulation, as in the hazard formulation, nobody is certain to either avoid CHD or develop CHD. Rather, the risk is considered to increase with the level of the specified risk characteristics. This is only an approximation of the real world but statistical functions based on this model (and particularly the logistic function) have been repeatedly shown to fit a wide variety of data sets reasonably well. Currently, the most commonly used function for predicting CHD in a fixed time interval, given certain characteristics or risk factors, is the logistic regression. In a regression mode1 the main test of relevance is the goodness of fit. If the regression of CHD incidence in a fixed time interval on a set of risk characteristics is computed, it is possible to rank the estimated probabilities of CHD from lowest to highest, and to divide the population into (say) deciles of risk. The agreement of the actual and expected number of cases in each decile can then be examined to assess the accuracy of the prediction. However, fit is not the only criterion for evaluating a risk function. It is also desirable that the risk slope be steep; that is, that only a small per cent of the cases fall in the lowest decile of risk (the smaller the better) and that a large per cent of cases fallin the highest decile of risk (the larger the better). This can be viewed as a classification test. Moreover, in principle, low risk can approach zero and high risk can approach unity. If a risk function can identify persons at very low and very high risk this would generally be considered desirable. On the other hand, if a single characteristic were ever discovered which either caused or prevented CHD with near certainty this characteristic should probably be evaluated separately, rather than included with other factors in a multivariate function. Whether the possibility of near-perfect prediction warrants serious consideration is moot, but in any event a focus on the very high and very low risk groups is entirely too narrow in terms of present knowledge. It does not take into account the groups


of intermediate risk that ordinarily account for a large proportion of the CHD incidence. When these groups are considered, it becomes obvious that what is sought is not perfect. classification but a function that gives as steep a risk gradient as possible. Before attempting to explore some of these statistical issues in terms of specific data it is well to consider the impact of a number of sources of imprecision in the basic data.





Caseyfinding The dependent variable under discussion, coronary heart disease, is the end product of a complicated, imperfectly understood process. A major component of the process is the progressive narrowing of the coronary arteries due to atherosclerosis. Autopsy studies indicate that some degree of coronary atherosclerosis is common in middle-aged American men and that this may account for the high incidence of CHD [2-41. However, at times the narrowing may be quite severe without any clinical manifestation. Alternatively, a clinical manifestation may sometimes occur with only minimal narrowing. Moreover, it is presumed that occasionally a thrombus arising from other processes may block a coronary artery, leading to myocardial ischemia with or without infarction. Arrhythmias may arise from these or other sources leading to sudden death. Under current rules of attribution such deaths are ordinarily ascribed to coronary heart disease although this is by necessity inferential. Nor is the diagnosis of clinical CHD certain. It is not uncommon to find evidence of an old myocardial infarction on autopsy that had never surfaced clinically [S-S]. Moreover, routine electrocardiograms will sometimes reveal a recent myocardial infarction in persons who have never evinced suggestive cardiac symptoms and would not have been discovered except for the routine electrocardiogram [S]. At the same time. the pathognomonic electrocardiographic evidence of myocardial infarction manifest acutely may revert to a nonspecific form with a relatively short time after the event [S, 9, lo]. If we add to these sources of imprecision the inevitable variation in electrocardiographic interpretation, in laboratory tests for enzymes used in diagnosing myocardial infarction, and in the presentation and interpretation of a clinical history, it is fairly clear that a true clinical event may sometimes go undiagnosed and events not due to coronary heart disease may be incorrectly attributed to this cause. While these sources of variation have not been precisely quantified, enough is known of their magnitude to indicate a non-trivial amount of misdiagnosis. Thus, even if we could correctly predict every true case of CHD, some of the cases would be missed and some of the noncases would be called cases. resulting in an apparently imperfect prediction.



The independent variables, the so-called risk factors, have similar difficulties. None of them can be measured with certainty. All of them constitute indices of processes which are highly complex. For example, blood pressure is notoriously variable and responsive to a variety of internal and external stimuli. Total cholesterol is now known to include one component (in the low density lipoproteins) positively associated with CHD incidence and another (in the high density lipoproteins) negatively associated with CHD incidence. The metabolic processes determining the levels of each are poorly understood but apparently quite complicated [Ill, 121. It is likely that these processes are in fact deeply implicated in atherogenesis. The measurement of total cholesterol indexes this only in a highly abbreviated manner. It would therefore be naive to anticipate that the prediction of CHD is likely to approach perfection as the disease itself and the usual risk factors are currently defined and measured.




The conditional probability of CHD in 2 yr given a set of risk characteristics is represented in this report by a logistic regression function in which the parameters are estimated by the method of Walker and Duncan [13]. Strictly speaking, this is a data summarization not prediction. Its relation to actual prediction will be discussed briefly later. The illustrations used come from the experience accumulated during the first 11 biennial examinations of the Framingham Study cohort [14] and are restricted to men 45-54 yr old at the beginning of each observation period. The risk factors considered are serum cholesterol, systolic blood pressure, cigarette smoking (yes, no), glucose intolerance (yes, no), heart enlargement on X-ray (definite or not), ECG-LVH (definite or not) and urine albumin (definite or not). Age and sex, which are also risk factors, are held substantially constant by restricting consideration to men 45-54. This roster of characteristics has only one thing in common: in persons free of CHD they are all associated with the risk that CHD will develop in the next 2yr. Serum cholesterol, blood pressure and cigarette smoking are alterable and are not manifestations of CHD. Glucose intolerance can be treated, but there is little reason to believe that the present therapies lower the associated CHD risk. Heart enlargement on X-ray or ECG-LVH is, in part, a manifestation of elevated blood pressure and in part a consequence of ischemic damage to the myocardium. Urine albumin may indicate hypertensive cardiovascular problems but it is not a specific manifestation of CHD. Age is not reversible, sex is ordinarily a fixed characteristic. Only serum cholesterol, systolic blood pressure and age are entered in these calculations as continuous variables. The others are treated as discrete variables. Criteria and measurement techniques are described in previous reports [lS]. Steepness of slope

There are several ad hoc statistics which may be used to measure the steepness of the risk gradient obtained by a logistic regression on a specified set of risk factors. For example, Menotti et al. [16] have proposed M, = f ci(i -a)‘/Nc i-1 as statistics, where ci are the number of cases observed in the ith decile, a is a constant (perhaps zero) and NC is the total number of cases. Somewhat simpler is the relative risk statistic obtained by dividing the number of cases in the upper decile (or quintile) by the number of cases in the lowest decile (or quintile)---Die/D, or Qs/Q1, in the usual shorthand. The more this number exceeds 1.0 the steeper the gradient. If D1, or Qi is zero this statistic is undefined, but that may be a warning that the total number of cases is too small for firm conclusions. A related statistic is the difference between the number of cases in the upper decile (or quintile) and in the lowest decile (or quintile), perhaps divided by the total number of cases. A ratio of zero implies that the risk factors were of no assistance in predicting CHD. Various combinations of the risk factors are considered in Table 1 and evaluated in terms of the criteria previously specified. Suppose we begin with systolic blood pressure and serum cholesterol. The sample ratio of Q5/Q1 is 2.60 for systolic blood pressure and 3.38 for serum cholesterol (Table 1). If we combine these two risk factors into a bivariate function the ratio of Q5/Q1 is 6.50. After adding cigarette smoking the ratio becomes 6.12 (clearly the sampling variability is fairly large if adding a risk factor decreases the sample ratio). Adding all the other variables, the ratio becomes 8.67. There is a special interest in what goes on in the upper and lower ends of the risk scale. Considering only the estimates of the number of cases in the upper decile of risk, it is clear in this instance that once two significant risk factors are considered, the incremental gain in the number of cases appearing in the high risk group is quite


small as additional risk factors are added. What does happen, however, is that more cases come to be shifted from the lower end of the predicted risk scale to the upper. This expresses itself in an improvement in correct identification of both low and high risk persons and a noticeable increase in the ratio Qs/Qi. However, after the first two variables (and almost any two variables of roughly equal strength would serve to illustrate the same point), adding additional variables increases the ratio rather slowly. This phenomenon is also frequently observed in multivariate linear regression. GOODNESS



If the regression of CHD incidence in a fixed time period conditional on some risk factor or factors is estimated, several criteria may be considered for determining how well the estimated regression fits the sample data. As Efron has observed there is no formally best criterion when the response variable is dichotomous, but one good criterion is the likelihood ratio statistic [17]. This may be used to assess whether the data are fitted significantly better by taking the risk factors into account than by assuming everyone to have the same risk. Another is to order the population by the magnitude of the estimated risk, dividing it into (say) 10 groups along the risk gradient and then comparing the actual and expected number of cases in each decile of risk by means of the usual x2 goodness of fit statistic. It is difficult to define the exact sampling characteristics of this statistic in this case beyond saying that a smaller number is better than a larger number. A ,third method for assessing goodness of fit is to calculate the average proportion of the variance of the probability of developing CHD accounted for by regression. This is a more complicated concept than may appear at first blush. ‘Proportion of variance explained’ is a statistical property of regression estimates that is particularly useful when dealing with data well described by a multivariate normal distribution. In the multivariate normal case, the usefulness of the conditional expectation of y given )ci, . . , x, can be evaluated by R2, the multiple correlation coefficient. It is conceivable that R2, which is sometimes described as the proportion of variance of J’ explained by xi,. . . , x,, can attain lOOO;,. In fact, instances where R2 approaches lOO(,‘;,are actually encountered in practice. Even in the multivariate normal case the phrase ‘proportion of variance explained’ sometimes is taken to mean more than it says. The proportion of variance explainedwhether it is large or small-does not cast any light on whether the variables considered are explanatory in a mechanistic sense or whether there are other important factors involved. Even where R2 is non-trivial it is entirely conceivable that a completely different set of factors could be found that had an equal ‘explanatory’ power to those under consideration. When the response variable is dichotomous, as in the case of CHD, additional interpretative problems arise. In this case 4’ takes on the values of zero or one (occurrence or non-occurrence of CHD in a specified period of followup). Furthermore, the conditional expectation of r: given xi, . . . , xs, say p(x), is a non-linear function of xi, . , x,. In such a context it is theoretically conceivable that the average proportion of variance of J’ explained by x1, . . , x, can reach lOO’A, but this can occur only in certain degenerate situations. Two such situations are cited in the appendix, the more interesting being the case where for some specified constellation of values of the risk characteristics a person is certain to develop CHD, i.e. p(xo) is one and for all ot.her constellations a person is certain not to develop CHD. We believe that such cases are of no interest in the real world. Rather we believe that the distributions of p (the CHD risk) ordinarily encountered are likely to be similar to the distribution of conditional probabilities from the seven variable function described in Table l-a right-skewed unimodal continuous distribution with a very long tail. Unfortunately this assumption is not subject to direct verification, since only estimated conditional probabilities can ever be known and altering the set of independent variables tends to alter the distribution of estimated conditional probabilities. The assumption of right-skewness implies that the mean of this



of Coronary




of Coronary




of Variance Explained


when responses

are dichotomous

To clarify ideas, it is useful to make an analogy with multivariate normal regression theory. In multivariate normal theory, one assumes random sampling of, say, (y, xi,. , xp), from a multivariate normal population with mean vector ({ic,, p’) and varianceecovariance matrix, x:, say. It is then well known that the conditional expectation of y given x1. . xD is E(y(r,,u,

. . . . . .Yr) = PO+ f





where /I0 is a function of (p,, be’) and the elements of Z while the 8, are functions only of the elements of Z. Furthermore, the conditional distribution of y is univariate normal with mean E(J\ x1,. _.xp) and variance a;(1 - RI) where uf is the unconditional variance of y and R2 is the multiple correlation coefficient between y and (x1, , xp). R2 is frequently referred to as the proportion of variance (of j) explained by Y, . xp. More generally we may define the average proportion of variance explained (APVE) by APVE

unconditional = ~___~~~


of y - average




of j


of y (2)

In multivariate normal theory. the modifier ‘average’ is not necessary since the conditional variance of !’ is constant. In the present context, a basic assumption is still that (r, x , , x2,. , xp) is a random sample from some distribution. In addition, it is assumed that y takes on the values zero or one and that E(ylx I,XZ,.

.Xp) = &lx) (3) = P(X)



. xp such as the logistic function. Moreover, the condi-

where p(x) may be a non-linear function of x,. x2,. tional variance of y is

EiCY- PcmX~

= PCMX).


where q(x) = 1 - Ax). The unconditional mean of y is Ep(x) = p. say, while the unconditional variance is E(y - ii)* = CCL. - p(x)1 + [P(X) - PI 1’ = E[y - p(x)]’ = Ep(x)q(x)

+ Ep’(x)

+ E/?(x)

- 0’


- pz

= py. Thus the average proportion of the variance of y explained by x,, x2..

, xp is given by

/=Jq- Ep(x)q(x)



It seems unlikely from (6) that all of the variance can be explained. There are situations, however, when the numerator of (6) is zero. One of these is when, even though there is a non-degenerate distribution of x, p(x) is either zero or one, indifferently of x. The only other situation of this kind appears to be when p(x) is one for x = x0 and p(x) = 0, otherwise. Neither of the extreme situations seem likely to occur in the real world. In particular, with respect to coronary heart disease outcomes, one expects interval rates to be on the average quite modest, perhaps in the range of l-15%. Further, one anticipates relatively few individuals with extremely high interval risks. These comments suggest that we anticipate a right-skewed distribution on the interval (0, 1), with a very long tail. It may be reasonable as well to anticipate that the distribution of risk is uni-modal. These expectations can and should be checked against data in any particular case. In any event, it is of interest to ask to what extent the average proportion of variance explained is limited by assumptions of the type described above. It is desirable to investigate this issue for fairly general probability densities of the variable, p. but first we will be more specific and assume the probability density of p is a Beta function density given by

m +B+(l

_ p)‘-’


O 1, it has already been noted that (7) is uni-modal. The modal value of p, pm say, is easily obtained by maximizing (7) with respect to p. One finds pm=




Transforming from I and B to p and pn, we calculate (9) to become P - Pm

1 - 3p, + p


while the derivative of (11) with respect to p,,, is -(I

- 2P)

(1 - 3p, + p)* From (12), it follows that for given p < f, the maximum average proportion for pm = 0; i.e. it is given by P

E-j?’ On the other hand, if p > f. it follows that the maximum average proportion for p,,, = 1

of variance explained occurs

(13) of variance explained occurs

and is given by 1 -p

2pp’ Finally, if p = f, the average proportion directly



of variance explained is f independently of pm (as may be computed from (1 1), (13), or (14) with p = f) and is the maximum average proportion of variance explainable


of Coronary




for any distribution of p well described by a Beta distribution with parameters r*, fi both greater than 1. Equation (11) is, of course. of interest in itself. It seems at least reasonable that a Beta distribution such as described is a good approximation to many uni-modal continuous (or discrete) distributions of p. But it is unlikely that the result assuming a Beta distribution of p will hold in every instance, especially for discrete distributions involving very few values. As an illustration, suppose: p assumes values p. and p1 (0 5 p. -c p1 < 1) with probabilities n and (1 - n) respectively and that np, + (1 - n)p, = p. Elementary calculations show that, for this case, AP”E






m and that. furthermore.

the maximum

for p. i p I p,.


is simply (16)

(t Pl~O - \ Poq,?. If one supposes p. = 0, APVE is, at most, p1 ; if p, = 1. APVE is at most qO. case. APVE can, in principle, be quite large. It would be of interest to calculate max unimodal distribution of p with k( > 3) classes, but we have been unable to do this. We can. however. as implied earlier, get results similar to (11). (13) and (14) under that p is distributed according to a continuous density f(p) on 0 s p i 1. and that f(p) To derive these results we first write (6) as

Thus, in this exteme APVE for a discrete the sole assumption is unimodal.

(17) and notice we can write (17) as

where p, is the value of p for which fh) we can write it as


its single mode. Concentrating

Now we call on the following Theorem. Let X be a random [A,(X) 2 0] be strictly monotone





(I, - P&(P)


only on the integral

dp +

s Pm

(P - P,)%P)



variable with probability density g(X), 0 _< X < ‘I, decreasing [strictly monotone increasing] with finite Eh, (X)h,(X)

< Eht (X)Eh,(X).

This theorem is an easy generalization of a theorem Considering the two integrals in (19) separately, decreasing and non-negative whilef(p) is non-negative on (0. p,). we have by the theorem

and let h,(X) 2 0, expectations. Then (20)

appearing in Kimball 1281. we first observe that in (0. p,), (p - P,)~ is monotone and monotone increasing. Taking g(p) to be the uniform

or Pm

I0 (P where F(p,) is the cumulative By a similar argument


Pm)*f(p)dp< $F(p




of p. evaluated




at p = p,

1 s Pm One can then write an upper

@ - p,)‘f(p)


-1 Pi _jyp pq [ 3

dp < (k3!@





for (18) as ) +



p.)‘l -



m)’ I








The bound (23) does not take into account that we are implicitly fixing the mean. p. To take this into account, we observe that the bounding distribution implied by the application of (20) is a mixture of two uniform distributions, one with weight F(p,) on (0, p,). the other with weight 1 - F(p,) on (p,. 1). The requirement that the mean is fixed at p thus leads to F(P,)=1-2p+p,. An analysis of 1241 based on the requirement (a) if p i 4. 0 i p, 5 2p. (b) if p > ), 2p - 1 5 pm 2 1.

0 < F(p,)

(24) 5 1, leads to the requirements:




of (24) into (23) leads after some simplification fi) - p, - !?(l

to - 2p)



as a bound for APVE. We observe that (25) is monotone decreasing in P,,, for p < f, monotone increasing in pm for p > f and takes the value l/3 for p = f. It follows immediately. if I, < :, that the maximum of (25) for variation in P,,, is given by 0 - P)i(l - F) while for p r -5. (25) assumes

a maximum

c4 - (1 -


is also of interest given by

to note



at p, = 1. A little re-arrangement


this maximum


PM1 - (1 - al-

for p < f. the minimum


of (25) [the


is at

pm= 2~ and is



3(1 -p)’ Similarly.

for p > i, the min-max


when pm = 213 -

1 and is given by


We observe that. for p c f. (27a). the min-max, knowledge of pm.

will be less than

(13). This


the importance


