J 0

Chron Dis Vol. 31, pp. 353-355 Pergamon Press Ltd. 1978. Printed

A NOTE

0021-9681/78/0501-0353$02.00/O

in Great Britain

ON THE CALCULATION OF MULTIVARIATE RISK FUNCTIONS PETER

London

School

of Hygiene

and Tropical (Received

MCCARTNEY Medicine,

Keppel

12 November

Street,

London

WCI,

England

1976)

INTRODUCTION

A common problein in the analysis of prospective screening surveys is the estimation of a multivariate risk function for follow-up events. This risk function is usually based on the logistic model and its parameters can be estimated in two ways: (1) via the discriminant function and ordinary least squares [l] and (2) maximum likelihood via iterative weighted least squares [2]. Recently Goldbourt et al. [3] have published some data in which the discriminant function estimates “failed to achieve any degree of fit and thus could not be considered for use”. This is unusual with data of this size and type. To quote Halperin et al. c41, “empirically, the maximum likelihood method usually gives slightly better fits to the model as evaluated for observed and expected numbers of cases per decile of risk” (my italics). For example, if we look at, say, Table 3 of [4] it is immediately clear that it would be impossible to make any judgement about the discriminant function estimates without seeing the maximum likelihood fit for comparison. This would apply to any fit, no matter how bad or good. On closer examination an explanation of the discrepancy seems possible. METHODS

Consider The study and n, in belongs to

a survey in which variables (x1, . . . , x,) = x are measured for each individual. comprises two populations, n, in which an event occurs during follow-up which an event does not occur. If P is the probability that an individual I7, then the multiple logistic risk function is of the form

It has long been known [S] that, under the assumption that x is multivariate normal with the same covariance matrix in I7, and n,, the multiple logistic function is identical to Fisher’s discriminant function (up to the constant a). Furthermore, the above assumption can be weakened to the condition that CBiXi be univariate normal with the same variance in both populations [l]. Fisher [6] showed that the estimates of the parameters of his discriminant function (bd) were (exactly) proportional to the coefficients (b,) calculated by regression of a dummy population variable on the independent variables x and he gave the value of scaling factor in [7]. If the dummy variable is 1 for members of IZ, and 0 for members of IZ,, it is quite easy to show using [8] and [9]% that b = (no + ?)(%

+

d

n1 -

4

b r

non1

(1)

pre,

where no and ~1~are the population sizes and P,,, is the proportion of the sum of squares unexplained by regression. A possibly more useful form of (1) when utilising the output from a typical regression package is bd =

Izo + n, - 2 b Residual sum of squares * ’ 353

(2)

PETERMCCARTNEY

354

In calculating the scaling factor, Goldbourt assumptions which led to the equation

et al. (p. 236-7) made several erroneous

b = (no + Q(no + d

111 -

1) b

I

(3)

n0ni

Comparing this to (1) we can see that (3) gives a biased estimate of the scaling factor, the bias increasing as P,,, decreases and being negligible only when there is little discrimination. PERFORMANCE

The performance measure used by Goldbourt et al. was “goodness of fit” of the observed and expected numbers of events per decile of risk. Although this is not an undesirable property and in their case proved useful in detecting biassed estimation, goodness of fit is not a necessary and sufficient condition for the least squares estimates to be good estimates of the true parameters. It would be feasible for a poor discriminator to give a good fit and vice versa. In addition, if the maximum likelihood fit has to be checked every time to validate the discriminant function estimates, then the discriminant function estimates become redundant. A more useful check on the validity of the analysis would be to plot the distributions of fl’x for each population. If these showed two normal distributions with similar variances then the underlying assumptions are satisfied, implying that the maximum likelihood estimates would produce little improvement in discrimination. It should be pointed out, however, that this is no guarantee that the individual parameter estimates will be similar. Examining the distribution of the risk function suggests a possible measure of its discriminatory power. Consider the top decile of risk for the data of Goldbourt el al. We can form a table:

Risk function

Then the 32% of the 8326/9192 = Youden [lo]

Top 10% Bottom 90% Totals

MI cases

Non MI

Totals

78 166 244

866 8326 9192

944 8492 9436

sensitivity of the top decile is 78/244 = 0.32, i.e. using the risk function MI cases were located in 10% of the population and the specificity is 0.91. For good discrimination we require high sensitivity and specificity has suggested that we combine these measures to form a new index J = sensitivity + specificity - 1.

So for these data J = 0.23. (It should be noted here that for low rates such as these the specificity is unlikely to stray far from 0.90 and it may be sufficient merely tc consider the sensitivity.) The choice of this form of performance measure was somewha arbitrary. It does have two useful properties: (1) in the case of the discriminant functior it can be applied without calculating the scaling factor and (2) most published analyse present the data in a form which allows the calculation of J-thus it is possible tc compare the performance of a new analysis with previously published results. Two other papers which deal with multivariate analysis of rates have been publishec by the Israel Ischemic Heart Disease Study. Although the rates in these two paper refer to different events and a different set of predictor variables was used, it is interestin, to examine their performance. Reference Kahn et al. [11] Medalie et al. 1123 Goldbourt et al. [3]

J 0.19 0.14 0.23

Multivariate

Risk Functions

355

The analysis of Goldbourt et al. demonstrated the highest level of discrimination (and hence bias) and was the only analysis where the fit of the discriminant function estimates differed from the maximum likelihood estimates. It should be pointed out that there is no implied criticism of the analysis once the maximum likelihood method has been chosen. SUMMARY

In the multivariate analysis of rates, the relationship between the discriminant risk function and the maximum likelihood logistic function is discussed with reference to a recently published analysis by Goldbourt et al. [3] which found the former to be inferior. The calculation of the discriminant risk function is shown to be in error and the performance assessment method (goodness of fit) is questioned. What appeared to be one more nail in the coffin of calculation of multivariate risk functions by non-iterative techniques is at least partially withdrawn. An additional measure of the performance of a risk function, which is often used to rate diagnostic tests, is suggested. REFERENCES

4. 5. 6. I. 8. 9. 10. 11. 12.

Truett J, Cornfield J, Kannel W: A multivariate analysis of the risk of coronary heart disease in Framingham. J Chron Dis 20: 511-524. 1967 Walker SH, Duncan DB: Estimation of the probability of an event as a function of several independent variables. Biometrika 54: 167-179, 1967 Goldbourt U, Medalie JH, Neufeld HN: Clinical myocardial infarction over a five-year period 3. A multivariate analysis of incidence. The Israel ischemic heart disease study. J Chron Dis 28: 217-237, 1975 Halperin J, Blackwelder WC, Verter JT: Estimation of the multivariate logistic function: a comparison of the discriminant function and maximum likelihood approaches. J Chron Dis 24: 1255158, 1971 Welch BL: Note of discriminant functions. Biometrika 31: 218-220, 1939 Fisher RA: The use of multiple measurements on taxonomic problems Ann Eugen Lond 7: 179-188, 1936 Fisher RA: The statistical utilization of multiple measurements. Ann Eugen Lond 8: 376386, 1938 Healy MJR: Computing a discriminant function from within sample dispersions. Biometrics 21: 1011-1012, 1965 (Note two misprints in the use of constants k and D,.) Anderson TW: Introduction to Multivariate Statistical Analysis. Wiley, 14c-141, 1958 Youden WJ: Index for rating diagnostic tests. Cancer 3: 32-35, 1950 Kahn HA, Herman JB, Medalie JH et al.: Factors related to diabetes incidence: a multivariate analysis of two years’ observation on 10,000 men. J Chron Dis 23: 617-629, >771 Medalie JH, Papier CM, Goldbourt U, Herman JB: Major factors in the development of diabetes mellitus in 10,000 men. Archs Int Med 135: 811-817, 1975

C.D. 31/5--o

A note on the calculation of multivariate risk functions.

J 0 Chron Dis Vol. 31, pp. 353-355 Pergamon Press Ltd. 1978. Printed A NOTE 0021-9681/78/0501-0353$02.00/O in Great Britain ON THE CALCULATION OF...
241KB Sizes 0 Downloads 0 Views