STATISTICALLY

Speaking

Predicted Probabilities' Relationship to Inclusion Probabilities It has been shown that under a general multiplicative intercept model for risk, case-control (retrospective) data can be analyzed by maximum likelihood as if they had arisen prospectively, up to an unknown multiplicative constant, which depends on the relative sampling fraction.1 With suitable auxiliary information, retrospective data can also be used to estimate response probabilities.2 In other words, predictive probabilities obtained without adjustments from retrospective data will likely be different from those obtained from prospective data. We highlighted this using binary data from Medicare to determine the probability of readmission into the hospital within 30 days of discharge, which is particularly timely because Medicare has begun penalizing hospitals for certain readmissions.3

Clinical epidemiological research typically uses prospective or retrospective data. Depending on the type of data, conflicting findings can occur. Using prospective data, Morse et al.4 were unable to support their previous findings from a retrospective study that tubal ligation and subsequent hysterectomy resulted in an increased risk of hydrosalpinx formation compared with tubal ligation alone. Without further inquiry, the earlier finding would have dissuaded patients from obtaining a subsequent hysterectomy. Although the pros and cons of using prospective or retrospective data are known, the fact that regression coefficients and predictive probabilities can be directly affected by the data type is rarely addressed. If data from a prospective or a retrospective study are analyzed using a logistic link, the interpretation of the regression coefficients does not differ. However, with other links such as the commonly used probit, or log-log (used when the probability of an event is very small or very large), the interpretation of the coefficients and the relative risk will differ. The predicted probabilities will always differ on the basis of whether the predictive model

was fitted using prospective or retrospective data, as well as the inclusion probabilities. A prospective study is often conducted to determine whether there is an association between certain exposure factors and the occurrence (probability) of a particular event. Retrospective (or case-control) studies are good for studying rare conditions because they are relatively inexpensive, do not require a large sample size, and require less time. The relationship between exposure and occurrence can be investigated through the fit of generalized linear models such as logit, probit, or complementary log-log link.

transformation of the standard normal distribution), and complementary log-log (transforming with log twice).

RETROSPECTIVE AND PROSPECTIVE COMPARISONS WITH DIFFERENT LINKS

Binary Models

To analyze binary data, we modeled the mean of the response taking on the value 1 (event) and 0 (nonevent) as a function of a set of covariates with a set of regression coefficients to be estimated. This allowed the binary response data to be modeled through a transformation from the range (0, 1) to (–∞, ∞). Models include the logit (logistic transformation), probit (inverse

May 2015, Vol 105, No. 5 | American Journal of Public Health

Definitions Let Z = 1 or 0 if a unit is sampled or not sampled. Let D denote that the event has occurred, and let D denote that it did not. Define P(Z = 1|D) and P(Z = 1|D) as the inclusion probabilities of the event and the nonevent, respectively, in the sample. The ratio of these inclusion probabilities is defined as k. Neither the inclusion probabilities nor their ratios depend on the covariates.

For a logit model, the regression coefficients are equivalent regardless of whether the data were analyzed as a prospective or a retrospective study. The odds ratios will be equivalent, but the predicted probabilities will differ depending on the value of k, the ratio of the inclusion probabilities of the event to the nonevent.5 In a log-log model and a probit model, the regression coefficients for prospective or retrospective data are not equal, so neither the odds ratios nor the probabilities are equal.

Fang et al. | Statistically Speaking | 837

1.0 0.9 0.8

Prospective

0.7 0.6 0.5 0.4

k = 0.33

0.3

k = 0.50

0.2

k=2

k=1 k=3

0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Restrospective

Prospective

FIGURE 1—Predictive probabilities based on prospective and retrospective data with varying inclusion ratio.

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10

Logit Probit Cloglog

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Retrospective

FIGURE 2—Medicare data example.

838 | Statistically Speaking | Fang et al.

American Journal of Public Health | May 2015, Vol 105, No. 5

 STATISTICALLY SPEAKING 

The predictive prospective probability will be greater than the predicted retrospective probability if 0 < k < 1, but equal if k = 1 and less if k > 1. The relationship between the retrospective and prospective probabilities is the same regardless of the link. In general, the predictive prospective probabilities P P are related to the retrospective probabilities P R through kP R (1) P P = ___________ 1+kP R – P R (Figure 1).

MEDICARE DATA We retrospectively sampled 1625 Medicare patients from the Arizona State Inpatient Database (2003–2005) and coded them 1 (n = 880) if they were readmitted (and had 4 readmissions in that period) into the hospital within 30 days of discharge and 0 (n = 745) if they were not. Selected covariates modeled included length of stay (LOS) and others. We fitted logit, probit, and complementary log-log to these retrospective data using 17% as the 30-day readmission rate, an estimate for Arizona.6 The fitted logit, probit, and complementary log-log models are (2) Logit{P(D|Z = 1,x)} = 0.0240 + 0.0265 LOS Probit{P(D|Z = 1,x)} = 0.0340 + 0.0129 LOS Complementary log-log{P(D|Z = 1,x)} = –0.3004 + 0.0094 LOS LOS was significant in all models with P values of .006, .007, and .009. The predicted prospective probabilities obtained from the models in equation 2 and shown in Figure 2 were consistently smaller than predicted on the basis of retrospective modeled data.

CONCLUSIONS AND RECOMMENDATIONS Obtaining predicted probabilities after fitting a binary model is common in data analysis. Using retrospective data to derive predicted probabilities, however, may provide biased estimates. This issue is not usually addressed. We presented the relationship between predicted probabilities based on data

treated prospectively or retrospectively using 3 common binary models (logit, probit, and complementary log-log). Without adequate inclusion criteria to ensure that the sample is representative of the population, the predicted probability based on retrospective data will be different from what it would be if the data were prospective. Using Medicare data, we showed that if the events-to-nonevent ratio in the sample is greater than that in the population, the estimated predicted probabilities in retrospective studies will be larger than those from prospective studies. In other words, improvements seen in the decrease in 30-day rehospitalizations going forward using prospective data may be a result of statistical artifact rather than an actual improvement. Q

August/02/readmission-penalties-medicare-hospitalsyear-two.aspx. Accessed July 28, 2014. 4. Morse AN, Schroeder CB, Magrina JF, Webb MJ, Wollan PC, Yawn BP. The risk of hydrosalpinx formation and adnexectomy following tubal ligation and subsequent hysterectomy: a historical cohort study. Am J Obstet Gynecol. 2006;194(5):1273–1276. 5. Agresti A. Categorical Data Analysis. 3rd ed. Hoboken, NJ: Wiley; 2013. 6. Jencks SF, Williams MV, Coleman EA. Rehospitalizations among patients in the Medicare feefor-service program. N Engl J Med. 2009;360(14):1418–1428.

Di Fang, BS Jenny Chong, PhD Jeffrey R. Wilson, PhD

About the Authors Di Fang is with the Morrison School of Agribusiness, Arizona State University, Mesa. Jenny Chong is with the Department of Neurology, College of Medicine, University of Arizona, Tucson. Jeffrey R. Wilson is with the Department of Economics, W. P. Carey School of Business, Arizona State University, Tempe. Correspondence should be sent to Di Fang, Morrison School of Agribusiness, 7001 East Williams Field Road, Mesa, AZ 85212 (e-mail: [email protected]). Reprints can be ordered at http://www.ajph.org by clicking the “Reprints” link. This article was accepted January 20, 2015 doi:10.2105/AJPH.2015.302592

Contributors D. Fang, J. Chong, and J. R. Wilson contributed equally to this article. D. Fang prepared the article. J. R. Wilson supervised the data analysis and writing, and revised the article. J. Chong reviewed the article and provided improvements.

Acknowledgments We thank Darlene Lopez and members of the American Public Health Association Section on Statistics for their useful insights in reviewing this article.

References 1. Weinberg CR, Wacholder S. Prospective analysis of case-control data under general multiplicative-intercept risk models. Biometrika. 1993;80(2):461–465. 2. Hsieh DA, Manski CF, McFadden D. Estimation of response probabilities from augmented retrospective observations. J Am Stat Assoc. 1985;80(391):651– 662. 3. Rau J. Armed with bigger fines, Medicare to punish 2225 hospitals for excess readmissions. Available at: http://www.kaiserhealthnews.org/Stories/2013/

May 2015, Vol 105, No. 5 | American Journal of Public Health

Fang et al. | Statistically Speaking | 839

Predicted probabilities' relationship to inclusion probabilities.

It has been shown that under a general multiplicative intercept model for risk, case-control (retrospective) data can be analyzed by maximum likelihoo...
656KB Sizes 2 Downloads 11 Views