International journal of Epldtmfology C Oxford Unlvarafty Pros 1977

Vol. 6, No. 4 Prinud in Grtrt Britain

Paths of Association in Epidemiological Analysis: Application to Health Effects of Environmental Exposures* JOHN R GOLDSMITH1

Two criteria for selection of paths of association to be discussed are temporal or spatial plausibility, and inferences from partial regression or correlation. In conventional stepwise multiple regression, independent variables, say x and y, are assumed to have a symmetrical relationship with each other, x' , y and each is assumed to have an asymmetrical relationship with the dependent or health status variable, z. The stepwise procedure chooses which independent variable is associated with the greater variance in the joint co-variance of x and y with z.

INTRODUCTION

The purpose of this paper is to suggest improvements in the procedures and logic of multivariate analyses commonly used in epidemiology. Such methods are based on certain assumptions concerning the paths of association among the presumptively independent variables, and hence the pattern and mechanism by which they may affect the dependent variable which is usually a morbidity or mortality rate. Some of these conventional assumptions are logically improbable. One type of possible associative path is usually not considered. In this paper, examples of the unlikely assumptions, and the paths of association not customarily tested will be given along with a discussion of the expected consequences of an alternative mode of analysis. This approach, derived from 'path analysis', will be considered in relationship to these assumptions.

The application of these principles to systems with four or more independent variables will be our major concern. This paper puts forward some reasons why such a multivariate set should be considered to have some pairs of independent variables in an asymmetrical rather than symmetrical relationships. The entire set of relation-

1 Medical Epidemiologist, EpidemiologicaJ Studies Laboratory, California State Department of Health, 2151 Berkeley Way, Berkeley, California 94704, USA. * Based in part on a paper given at the Seventh International Scientific Meeting of the International Epidemiological Association, University of Sussex, August 1974.

391

Downloaded from http://ije.oxfordjournals.org/ at University of Otago on July 11, 2015

Goldsmith, J R (Epidemiological Studies Laboratory, California State Department of Hearth, 2151 Berkeley Way, Berkeley, California 94704, USA). Paths of association in epidemiological analysis: application to health effects of environmental exposures. Internationa/ Journal of Epidemiology 1977, 6 : 391-399. Conventional regression analysis is based on assumptions of bidirectional associations between pairs of independent variables. In a number of circumstances these assumptions are not plausible. Structural representation in conventional regression is based on a set of parallel paths between independent and dependent variables; when the implausible assumptions are excluded, a different structural relation between independent and dependent variables is found. It permits a series associative path between independent variables. Two criteria for modification of conventional muttivariate analysis are presented. They are when bilateral symmetry among independent variables is implausible on the basis of a priori information, and when there are significant differences between zero order and first order partial correlation coefficients. When these criteria are applied, there may result a series-parallel matrix of associations. For analysis of such a matrix, the procedures of path analysis are appropriate. The concepts are illustrated with environmental examples, and path analytical computations are worked out for a set of data on social and environmental factors which affected infant mortality in England and Wales between 192S-1938.

392

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

ships among variables is referred to as a structure, of which the structure shown for x, y, and z above is an elementary example. More elaborate structures are found in our earlier paper (1).

T-

•P

T-

PM-

•T*

M-

•T*

P-

•P*

Of these, the three starred are unlikely; high pollution doesn't cause a heat or cold wave, and neither does mortality cause pollution or temperature extremes. Temporal sequencing provides a natural polarity of associations. For example, if the high pollution levels preceded the heat or cold wave by several days, it is difficult to accept the T >-P causal path. However, we might postulate a weather system (W) change as producing both the temperature change and the high pollution. Without the postulate we could write

and with it we could write W-

•T

W-

T-

•P

T-

•P

W-

•M

P-

•M

representing the array of plausible bivariate paths. Noting that the temperature change is a possible antecedent factor for both pollution increase and mortality increase, and that changes in temperature or pollution might be possible antecedent factors to increase mortality we could write:

And with the assumption that weather system changes are the appropriate antecedent of both the beat or cold wave and the pollution increase, analogously we could write:

Structural relationships likely to describe associations of Temperature (J), Pollutants (P) and Weather variables (W) with mortality.

Both of these diagrams represent structural ordering based on common sense exclusions. Both are based solely on asymmetrical relationships among independent variables. The customary assumptions of symmetrical relationships in multivariate analysis would represent all the associations among independent variables as bidirectional. (Figure 2.)

Fro. 2 Structural relationships assumed in customary multivariate analysis of associations of Temperature (T), Pollutants (P) and Weather variables (W) with Mortality (M). Unaccounted for variables are represented by (a).

As shall be demonstrated, the alternative of asymmetrical rather than symmetrical relationships among independent variables has consequences in ordering of variables in a stepwise process and in estimation of variance 'explained'. Inferences From Partial Regression and Correlation By deleting the variables one at a time, we may test for the statistical significance of the contribution

Downloaded from http://ije.oxfordjournals.org/ at University of Otago on July 11, 2015

Structural Relationships Based on Temporal or Spatial Plausibility Structural relations can be inferred from certain common sense considerations. For example, certain relationships have a polarity which requires us to assume that an association, say between a heat or cold wave (T) and an increase in daily mortality (M) of a big city, goes in only one of two possible directions. (The increase in mortality did not cause the heat or cold wave!) If a heat wave is also associated with high levels of photochemical oxidant (P) as well as increased mortality (as is likely to occur in Southern California) or a cold wave is associated with sulphurous pollution as well as increased mortality (as is likely to occur in the UK) we would have three variables and six possible bivariate asymmetrical path relationships.

393 The assumed structure is called a path diagram which may also be considered a statement of an hypothesis regarding a pattern of associations. The magnitude of the individual paths of association in the pattern is assumed to be testable by empirical data using the procedures of analysis. As with other hypothetical statements, different investigators provided with the same data, may propose different path diagrams.

PATHS OF ASSOCIATION IN EPIDEMIOLOGICAL ANALYSIS

Downloaded from http://ije.oxfordjournals.org/ at University of Otago on July 11, 2015

of one among a set of variables, to the variance of M. We may then observe the alteration in the contribution to variation of the remaining variables. We could then state as a criterion that: In general, if adding a structurally plausible variable A with coefficient Xa to a regression equation with one or more independent variables N, with significantly non-zero coefficients Xn, causes the regression coefficient of one or more of the N variables, B with coefficient Xb to become statistically indistinguishable from zero, then it is likely that the path of the effect of B on the dependent variable Tukey (4) in 1954 drew a sharp distinction passes through A. Analagous criteria may be made between analysis based on regression coefficients, for the addition of variables which make important which he favoured, and correlation coefficients, changes in either direction for regression coefwhich he did not favour. Li (5) is currently applying ficients. In such a way does the common-sense path analysis to statistical genetics and his recent structural relationship suggest statistical tests. book is recommended as an introduction. Many recent applications of path analysis have A relevant example was the apparent significance been in sociology. Blalock (6) has been a leader in of pollutant variables in the multivariate analysis this effort. Its place in sociology, for example, is of mortality in metropolitan areas of the US by reflected in the prominence given to path analysis Lave and Seskdn (2). When to the other variables in the monograph on 'Sociological Methodology— were added variables representing the type of fuel 1969* (7); the first three articles (and nearly half used in household heating, the significance of the book) involve path analysis in theory and pollution variables was lost Type of fuel used in application. For one seeking a starting point in household heating was a structurally plausible principles of path analysis applied to sociological variable and by this criterion, we could say that variables the articles by Land (8), and Heise (9), the path for the effect of pollution passes through and the book on structural equations by Duncan the variable reflecting fuel used in household (10) are recommended. The subject is also exheating. By common-sense considerations it is the tensively treated in Sociological Methodology type of household heating which affects pollution, 1971 (11). not the reverse. One result of path analysis is a set of path In highly multivariate situations, the application coefficients which are defined as: 'a path coefficient of multiple regression analysis, assuming as it does Py is a number which measures the fraction of the a symmetrical relationship among independent standard deviation of the dependent variable (i) for variables, may fail to correctly partition the extent which the designated variable (j) is directly reof dependence among variables. Some concepts sponsible, in the sense that j varies to the same and procedures of path analysis, along with certain extent that it does by observation, under the condirelationships of partial correlation do permit a tion that all associated variables are held constant'. more versatile, dynamic, and numerically interesting approach which will be illustrated. The appliWhen using standardized variables (mean = 0, cation of partial correlation will be continued after standard deviation = 1) this is estimated by the discussion of path analysis. partial regression (or correlation) coefficient, holding constant all variables other than i and j . When the variances are thus made equal the path Path Analysis coefficient and the path regression coefficient are The phrase 'path analysis' was introduced by the the same. genetecist Sewall Wright (3) for the study of the Another criterion for inferring path relationships interrelationship and persistence of variously which seems useful, is based on analogy to the linked heterozygotic attributes after successive criterion introduced above, only using partial brother-sister mating in an inbreeding experiment. correlation coefficients rather than addition or subtraction of variables in a regression equation: By path analysis is meant a type of linear If in a matrix of correlation coefficients representregression analysis applied to an assumed ing a plausible array of presumably independent structure by which independent and dependent variables N, the difference between rya (correlation variables may be related in a closed system.

394

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

of y, the dependent variable, with a, one of the N independent variables) and rya.b (partial correlation of y on a holding b constant) is large then it is likely that the path of a on y involves b. For testing the significance of differences among correlation coefficients we use the Z transformation

Comparably we can test whether the path relationship of B to Y may involve A- The differences in magnitude of these two differences

ym

r

ya.b

are criteria for asymmetrical re-

and

lationships of a and b as independent variables in association with dependence of y. As can be shown rya.b is not equal to ryt>^. There is a sequential or serial dependence of A through B on Y in Figure 3, a type of association which is not currently being considered due to conventional use of multivariate statistical tests. The Basic Algebraic Relationships Among Correlation and Path Coefficients Parallel paths sum the path coefficients, while series paths multiply path coefficients. Path coefficients can be estimated from an appropriate set of estimating equations. For example, the diagram

Thus, in comparing I rya—rya.b I to the expected variance of ry, based on z transformation, if the difference is great, then leads to estimating equations: is relatively important compared to r

A

tm

=

»-Y r

It will follow that the former structural relationship could be expressed, as in

B Fio. 3 Examples of a series-associative path by which variable A affects Y through its influence on variable B. Such a path of association is usually part of a multivariate structure.

in which a is the symbol of all other variables which influence y.

tp = Vtp

Where Tmn stands for the correlation and Pmn for the path coefficient when m and n are connected or have associative paths. Note that r ^ = r ^ but that Pam / Pnm- A 'path diagram' denotes a set of working assumptions concerning the pattern of associations of variables in a multivariate set. A 'path model' refers to the equation (also called a structural equation) expressing a relationship among standardized variables and the paths Unking them in the path diagram. The 'path estimating equations' represent the basis for computing path coefficients. The path regression coefficients, Cjj equal the Py—. We shall be computing the path coefficients, Py from which path regression coefficients cy may be computed on the basis of this relationship.

Downloaded from http://ije.oxfordjournals.org/ at University of Otago on July 11, 2015

The standard deviation of Z corresponding to a given r is az = -\/N—m—1, where Nis the number of independent observations, m is the number of variables specified (two in the case of zero order correlation, more in partial correlations). Another way then of stating this criterion in testable form is: In a structurally plausible set of correlations of variables j . . . n of which variables k . . . n are considered independent, and j is considered dependent, a large absolute value of the z-transformed difference rjk—rjk.m relative to cz suggests that the path from j to k passes through variable m.

r

395 Comparing Series-Parallel With Conventional Paths Note that merely excluding implausible symmetry as shown in comparing Figures 1 and 2, has of Association the effect of converting parallel matrices of associThis experience applied to epidemiology leads to a set of descriptive, statistical, and algebraic ation to a parallel-series matrix of associations among independent variables and the dependent procedures, suitable for epidemiological data involving multivariate analysis. one. Conventional multiple regression inherently implies a series of models of parallel paths of associ- Application to Infant Mortality in England and Wales, 1928-1938—Numerical Example ation between independent variables and the dependent one as shown in Figure 2. On the basis Data collected by Woolf and Waterhouse (12) of the first criterion, the effect of adding one illustrate the results of such an analysis, using social variable on the contribution of another suggests a and environmental variables for county boroughs different form of path association, which we can for England and Wales for 1928-1938. designate as a series associative path. A similar The county boroughs of England and Wales inference may be drawn from the application of include about a third of the population, and with the second criterion based on partial correlation the exception of London, all the big towns. All but coefficients. The same inference could be drawn three had over 50,000 inhabitants, and Birmingham, from criteria of temporal or spatial plausibility. the largest, had over one million. Over the period of study, the infant mortality varied from 28 to 129 The differences between combined series-parallel per thousand (mean, 66-7; standard deviation = associative paths and pure parallel associative paths are numerically, strategically, and logically 17-05). In studying social variables associated with infant mortality, the authors sought... 'to obtain important. an estimate, as precise as available data allow, of A numerical example will be given later. Strategically, the objective of the epidemiological what reduction of infant mortality is possible by specific improvements in social conditions'. effort is to demonstrate an association which may The authors used the method of multiple be used to prevent disease or improve health. It is regression. usually not feasible to intervene effectively in more than one independent variable at a time, and even The following variables were chosen, on a basis if it were possible that multiple variables could be which is fully discussed in their report. dealt with simultaneously, we should want to H an index of crowded housing which is the know from available data which was likely to be percentage of families with, in the 1931 the most effective intervention. Accordingly, if we Census, more than one person per room, were able to choose which variable to approach having a mean of 28 -2 per cent, and a range first in trying to deal constructively with a multiof 14-1 to 52-2 per cent. variate problem, epidemiologists would have a U an index of unemployment, which was the great strategic advantage. Often stepwise regression percentage of males unemployed, according is used as the basis for this choice, based on the to the Ministry of Labour's Local Unemconventional assumptions of purely parallel reployment Index averaged over the year lationships between all independent variables and prior to that in which infant mortality was the dependent variable. The important independent incurred. variable(s) to approach first may appear to be in a P an index of poverty, which was the perdifferent sequence for tests of association in a seriescentage of occupied males in semi-skilled parallel matrix than would be so for the more conand unskilled occupations according to the ventional parallel matrix. 1931 Census. Finally, there is a logical weakness in failing to F an index of the proportion of women use knowledge about temporal or spatial sequences employed in manufacturing based on data or other common sense relationships, when the from the 1931 Census. information is available. We have no reason to L latitude, which previously had been shown discard the application of available information to have an important association with infant when we come to analyse a complex problem. It is mortality. more attractive to use all knowledge, both based on measurement as well as a priori information Of these variables, only the unemployment and about absence of associative paths or relative infant mortality varied from year to year and if for importance of possible paths. each of the eleven years and 82 county boroughs a PATHS OF ASSOCIATION IN EPIDEMIOLOGICAL ANALYSIS

Downloaded from http://ije.oxfordjournals.org/ at University of Otago on July 11, 2015

396

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

mortality regression were calculated, it would yield 902 equations. Weighting each county borough by its population did not increase the proportion of explained variance R2. The computations to follow are based on the correlation matrix shown in Table I. There is some uncertainty about the number N to use in the expression -y/N—m—1 to use to compute the variance of Z. With 83 county boroughs and 11 years, one might think it would be 82 x 10. Since, in fact, only two variables did indeed vary by year, we have elected to adopt N = 82 x 2. This gives a az approximately = 0-08, for Z transformed zero order correlations. Computational Approach To carry out the path analysis strategy, we now compute the matrix of first order partial correlation coefficients which is shown in Table LT. These coefficients have the form ry.k where j is the columned variable, k is the row variable, and i represents the dependent variable, infant mortality. We interpret a large difference compared to az between ry.k compared to ry as indicating that k

TABLE I

Zero order correlations affecting infant mortality by County Borough, England and Wales, 1928-1938 Variable

1 M

1 2 3 4 5 6

Infant Mortality Unemployment Housing Poverty Employed Women Latitude

U

10000 0-4208 0-5784 0-5253 0-2624 0-5239

H

10000 0-5236 0-4461 0 0849 0-3950

10000 0-6870 0-0217 0-5288

10000 00768 0-5007

10000 0-2867

10000

TABLE n

First order partial correlation coefficients of infant mortality with column variables holding row variables constant, County Boroughs, England and Wales, 1928-1938 Correlation of mortality with Variable held constant

i 2 3 4 5 6

Unemployment Housing Poverty Employed Women Latitude

2

4

3 U

H

5 P

0-4158* 0-4633* 0-1697*** 0-2158*** 0-2448** 0-3518** 0-4609 0-6055 0-5250 0-2733* 0-4169** 0-3567** Standardized Regression Coefficient 01592 0-2389 01317

• Differs from zero order correlation by \a-2o. •• Differs from zero order correlation by 2

Paths of association in epidemiological analysis: application to health effects of environmental exposures.

International journal of Epldtmfology C Oxford Unlvarafty Pros 1977 Vol. 6, No. 4 Prinud in Grtrt Britain Paths of Association in Epidemiological An...
665KB Sizes 0 Downloads 0 Views