Drug and Alcohol Dependence, 26 (1990135-38 Elsevier Scientific Publishers Ireland Ltd.

35

A note on the use of statistical models in epidemiologic research on illicit drug use

James Department

of Mental

Hygiene,

School

of Hygiene

(Received

C. Anthony

and Public Health. 21205 RIJS.AI November

The Johns

Hopkins

University,

Baltimore,

MD

13th. 1989)

This essay aims to stimulate thinking or to remind readers about the shortcomings of standardized regression coefficients and related statistical measures in epidemiologic research on illicit drug use. This is accomplished primarily with a set of examples based on simulated epidemiologic data in which the standardized regression coefficient is shown to co-vary dramatically with frequency of the outcome variable. The basic thrust of this critique of commonly used regression models is not new; it has appeared elsewhere several times. Nevertheless, in epidemiologic research on illicit drug use, there is a continuing use of standardized regression coefficients and other margin-sensitive statistical measures without comment on their shortcomings. Thus, a specific critique with illustrations might have value. Key words:drug

abuse; epidemiology;

risk factors

Introduction In epidemiologic research, distributions of variables often are dichotomous, with one value characteristic of most subjects and the other value characteristic of few. A typical example is seen in a proportion or a rate: for example, in a recent study in which psychiatrists made diagnoses based on standardized examinations, we found that among adult household residents of eastern Baltimore, a total of 2.2% qualified for a standardized diagnosis of drug abuse-dependence syndromes [l]. Within the public health branch of epidemiology, there is a preference for stratified analyses of such data, and a rather heavy reliance upon the odds ratio as a statistical measure of the strength of association between suspected explanatory variables and occurrence of health-related events (such as illicit drug use). Consistent with this emphasis, there 03%8716/90/$03.50 0 1990 Elsevier Printed and Published in Ireland

Scientific

Publishers

is frequent use of the logistic regression model, which provides estimates of the odds ratio and, thereby, identification of factors associated with occurrence of disturbed behavior or health [2]. Some observers have pointed out that epidemiologists might rely too heavily upon the odds ratio and the logistic regression model [3]. Nevertheless, there is a broad consensus that alternative models and statistical measures often fail to serve well in the epidemiologic context. Standardized regression coefficients, correlations, and path coefficients are notorious for their distortion of biologic effects in epidemiologic research, largely because they are computationally dependent upon the marginal distribution of the outcome variable, the independent variables, or both [4]. Epidemiologists and biostatisticians hold no monopoly over this concern about the use of standardized regression coefficients and other Ireland Ltd.

36

related techniques that have demonstrable utility in other contexts. Sociologists and others have written about this problem in relation to sociological, psychological and educational research [5 - 81. Some practitioners have drawn attention to the problem in their research reports [9]. Broad variation in occurrence of illicit drug use The problem can be made concrete by observing that drug researchers often are interested in modeling the occurrence of illicit drug use (either prevalence or incidence) as a function of suspected explanatory variables. If the problem is framed in relation to prevalence (e.g., the probability of being a currently active illicit drug user), then frequently observed mean prevalence values range from less than 1% (e.g, for heroin) to about 10% (e.g., for marijuana), though in certain subgroups the prevalence of active marijuana use might exceed 25% [lo]. If the problem is framed in relation to incidence (e.g., the probability of becoming a new case during some defined span of population experience), then the observed mean incidence values typically will have a much lower range. For example, annual incidence estimates for drug abuse and/or dependence during the early 1980s were recently reported [ll]. The largest reported estimate, for males age 18-29 years old, was O.O44%/year. Proportions and rates such as these can vary considerably from one population group to another, from city to city, or from region to region. For example, among 12 - 17-year-old Americans surveyed in 1985, the estimated prevalence of recent cocaine use ranged from 0.5% in the South to 2.0% in the West. Among 18- to 25-year-olds, the values ranged from 3.0% in the South to 12.9% in the West [lo]. Of clear interest is what might underlie this type of variation. What is needed is an approach that will not be obscured by underlying variation in the proportions. Simulated data The following

table of simulated

data dem-

onstrates how group-to-group or region-toregion variation of this magnitude can undercut the apparent value of regression models. Specifically, the table illustrates how a difference or change in the marginal distribution of a dichotomous dependent variable (e.g., current cocaine user: yes or no) can have an important effect on standardized regression coefficients (Beta) and related statistical measures. As will be seen, this often is not the case for coefficients from logistic regression. In these simulations, the number of observations is 50,000, large enough for disregard of issues relating to statistical significance. The dependent variable, A, is dichotomous, and in the table, PR is the probability of a positive response on A at the lowest value of a single independent variable X (PR = probability of A = 1, given X = 1). In the simulated examples shown in the first three rows of Table I, X also is dichotomous. It takes on values of 1 and 2 with equal frequency (i.e., 25 000:25 0001, as might be true in the study of gender as a factor associated with risk of cocaine use. In the simulated examples shown in the last three rows of Table 1, X takes on the five values from 1 to 5, each occurring with equal frequency (e.g., quintiles of the X distribution). In these models, X is treated as a quantitative variable, which is consistent with the simulation described next and which might be appropriate when studying an index of socioeconomic status in relation to prevalence of cocaine use. For each example, the hypothetical data have been simulated so that the probability of A = 1 is doubled for each unit increase in the value of X. That is, the simulated strength of association between A and X is a constant (2.01. Thus, in the first row of Table I, PR = 0.005 for X = 1. Consistent with the doubling rule, PR = 0.010 for X = 2. In the fourth row, PR = 0.005 for X = 1, and so PR = 0.010 for X = 2, PR = 0.020 for X = 3, PR = 0.040 for X = 4, and PR = 0.080 for X = 5. This corresponds to a situation in which the risk of drug use would double for every unit increase in X the simulated relative risk value was two. In all but one of the examples shown in

37 Simulated data to allow a comparison Table I. under varying conditions (N = 50000).

of results from logistic regression

and ordinary least squares regression

O.L.S. regression

PR*

Range OfX

Logistic regression B = InfOR)

OR

P

B

Beta

R-squared

0.005 0.010 0.050

1 to 2 1 to 2 1 to 2

0.6982 0.7033 0.7472

2.01 2.02 2.11

0.0001 < 0.0001 < 0.0001

0.005 0.010 0.050

0.0001 0.0001 0.0001

0.029 0.041 0.095

0.0008 0.0017 0.0090

0.005 0.010 0.050

1 to 5 1 to 5 1 to 5

0.7182 0.7456 1.1293

2.05 2.10 3.09

< 0.0001 < 0.0001 < 0.0001

0.018 0.036 0.180

0.0001 < 0.0001 < 0.0001

0.147 0.211 0.550

0.0216 0.0446 0.3029

P

*PR = Probability of A = 1, given X = 1. OR = Estimated odds ratio. B = Unstandardized regression coefficient. A technical appendix including the SAS job stream and log from the simulation is available upon request to the author.

Table I, the regression coefficient under the logistic model was consistent with the simulated doubling in risk for each unit increase in X. The antilogarithm of the coefficient is the odds ratio, which often serves well as an estimate of relative risk. In the exception, the antilogarithm of the coefficient was 3.09, not too distant from the simulated relative risk value of 2.0, but different enough to demonstrate how the odds ratio estimate of relative risk can fail when the overall probability of a positive outcome (A = 11 is large (e.g., when mean prevalence is large). In this example, where PR = 0.05 when X = 1, the overall mean probability of A = 1 was 15 500/50 000, a value greater than 0.30. Table I also shows the standardized coefficients (B and Beta) and R-squared estimates from ordinary least squares regression, which often have been used to study suspected risk factor relationships involving occurrence of drug use. Although the hypothetical data were simulated so as to keep relative risk constant, these regression results change with changing probability of a positive outcome (the remaining variables in these regression models are held constant). This sensitivity to the marginal distribution of the outcome follows directly from their calculation as statistical measures. Unlike the odds ratio which is margin-insensitive, these statistical measures are margin-

dependent (for additional details, see Refs. 12 and 4.1 Of course, within any given dataset, Beta might allow for a comparison between variables (e.g., in relation to relative strengths of relationships). Nevertheless, in most instances the analyst has a goal of prevalence results that are not dataset-specific. Hence, this allowance is not too valuable. Implications and conclusion This essay uses simulated data to draw attention to a previously-voiced critique of regression coefficients standardized and related statistical measures. These statistical measures now are reported in epidemiologic research on illicit drug use, without note of their shortcomings. To some extent, these shortcomings can be overcome with use of the logistic regression model and its estimates for the odds ratio and relative risk. Nevertheless, the simulation pointed out that this alternative regression model does not always serve well, and there are other reasons to be cautious in use of the odds ratio [3]. One key difficulty with the standardized regression coefficient and other correlationbased measures is margin-sensitivity. The standardized regression coefficient, by definition, varies in relation to the marginal distribution of the outcome variable (and also the

38

corresponding independent variable). Thus, when an investigator reports this coefficient based on data from a population with a moderate-to-high occurrence of illicit drug use, it is inherently and unnecessarily tied to that level of occurrence. As shown in the simulation examples, readers can be fairly confident that the value of the estimated coefficient will not apply in other populations where illicit drug use occurs less frequently. Given the variation in occurrence of illicit drug use observed in recent surveys, this undercuts potential significance of the reported work. In conclusion, this essay may provide drug researchers with an additional reminder that statistical models found useful in other contexts will not always serve in the epidemiologic context. The foundation for epidemiologic analysis of rates and proportions is careful exploration of univariable distributions and bivariable relationships [13,14]. This foundation provides for informed selection of statistical models and, ultimately, it strengthens the basis for informed judgments about the models’ capacity to reveal important effects without distortion.

References 1 2 3

4 5 6 7

8

9 10

11 12

13

Acknowledgements Supported in part by grants from the National Institute on Drug Abuse (DA03992, DA043921. The author thanks Dr. William Eaton for helpful comments and Ms. Jill Schreiber for editorial assistance.

14

Anthony, J.C. et al. (19851 Arch. Gen. Psychiatr., 42, 667. Rothman, K.J. (19861 Modern Epidemiology. Little, Brown and Company, Boston MA. Feinstein, A.R. (19851 Clinical Epidemiology: The Architecture of Clinical Research, W.B. Saunders Company, Philadelphia PA. Greenland, S., Schlesselman, J.J. and Criqui, M.H.09851 Am. J. Epidemiol., 123-203. Blalock H.M. (19601 Social Statistics, Second Edition. McGraw-Hill Book Company, N.Y. Cleary P.D. and Angel R. (19841 J. Health Sot. Behav., 25,334. Maddala, G.S. (19831 Limited-Dependant and Quantitative Variables in Econometrics. Cambridge University Press, Cambridge. Muthen B. (19881 LISCOMP: Analysis of Linear Structural Equations with a Comprehensive Measurement Model. Scientific Software Inc., Mooresville IN. Kaplan H.B. and Martin S.S. (19841 J. Health Sot. Behav., 25, 270. United States. (19881 Department of Health and Human Services, NIDA DHHS Publication No. (ADMl 88- 1586. Eaton, W.W. et al., (19891 Acta. Psychiatr. Stand., 79, 163. Bishop Y.M.M., Feinberg S.E. and Holland P.W. (19751 Discrete Multivariate Analysis: Theory and Practice. The M.I.T. Press, Cambridge MA. Fleiss J.L. (19811 Statistical Methods for Rates and Proportions, Second edn. John Wiley and Company. NY. Tukey J.W. (19771 Exploratory Data Analysis. son-Wesley Publishing Co., Reading, MA.

Addi-

A note on the use of statistical models in epidemiologic research on illicit drug use.

This essay aims to stimulate thinking or to remind readers about the shortcomings of standardized regression coefficients and related statistical meas...
358KB Sizes 0 Downloads 0 Views