Research Article For reprint orders, please contact: [email protected]

Can statistical linkage of missing variables reduce bias in treatment effect estimates in comparative effectiveness research studies? Aim: Missing data, particularly missing variables, can create serious analytic challenges in observational comparative effectiveness research studies. Statistical linkage of datasets is a potential method for incorporating missing variables. Prior studies have focused upon the bias introduced by imperfect linkage. Methods: This analysis uses a case study of hepatitis C patients to estimate the net effect of statistical linkage on bias, also accounting for the potential reduction in missing variable bias. Results: The results show that statistical linkage can reduce bias while also enabling parameter estimates to be obtained for the formerly missing variables. Conclusion: The usefulness of statistical linkage will vary depending upon the strength of the correlations of the missing variables with the treatment variable, as well as the outcome variable of interest. Keywords: claims analysis • comparative effectiveness research • electronic medical records • missing variable bias • retrospective database studies • statistical linkage

Background Many of the analytic challenges faced in observational studies can be viewed as missing data problems. An important example is provided by comparative effectiveness research (CER) studies based upon observational data that lack adequate controls for clinical severity [1] . This occurs frequently in studies utilizing de-identified medical claims data. Depending upon the strength of the correlations of the missing variables with the outcome and treatment variables of interest, such missing variables may introduce substantial bias into treatment effect estimates and could cause an effect to be ascribed to treatment that is really due to unobserved variables [2] . Economists, in particular, have developed statistical methods that attempt to control for the bias from missing variables that are correlated with both treatment selection and outcomes. However, these methods hinge upon finding variables that are strongly correlated with treatments but uncorrelated with outcomes. Not surprisingly, this is often challenging and, as a result the so-called ‘instrumental variables’ are often weakly

10.2217/cer.15.23 © 2015 Future Medicine Ltd

correlated with treatment and have a nonzero correlation with outcomes. When this happens, these methods can perform more poorly than ignoring the missing variables altogether [3] . Direct control for missing variables is the preferred solution when possible. In some situations, this can be accomplished by directly linking the missing variables onto the analysis file at the patient level. However, for a variety of reasons this is often not possible. For example, it is often the case that a research dataset, such as a de-identified commercial or Medicare claims dataset, has been de-identified before being sent to the researcher. This makes the exact match of patients across datasets impossible. An alternative is to impute the missing variables through statistical linkage across datasets. Statistical linkage can reduce bias by introducing control of formerly missing variables or, alternatively, increase bias through errors introduced by imperfect matching. Most prior studies of statistical linkage have focused upon the potential bias introduced through imperfect linkage rather than the potential for reducing missing variable bias.

J. Comp. Eff. Res. (2015) 4(5), 455–463

William Crown*,1, Jessica Chang1, Melvin Olson2, Kristijan Kahler3, Jason Swindle4, Paul Buzinec5, Nilay Shah6 & Bijan Borah7 Optum Labs, One Main Street, 10th Floor, Cambridge, MA 02142, USA 2 Global Head HEOR Excellence, Novartis Pharma AG, 4056, Basel, Switzerland 3 Novartis Pharmacueticals Corporation, One Health Plaza, East Hanover, NJ 07936-1080, USA 4 Health Economics & Outcomes Research Optum, Inc., 200 E Randolph, Suite 5300, IL, 60601, USA 5 Health Economics & Outcomes Research MN002-0258, 12125 Technology Drive, Eden Prairie, MN 55344, USA 6 Division of Health Care Policy & Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905, USA 7 Mayo College of Medicine, Division of Healthcare & Medicine, 200 First St SW, Rochester, MN 55905, USA *Author for correspondence: william.crown@ optum.com 1

part of

ISSN 2042-6305

455

Research Article  Crown, Chang, Olson et al. For example, Tromp et al. address the concern of dependency among linking variables through naive and nonnaive approaches [4] . Their empirical results show that if there are enough variables used during the linking process, the prevalence rates of mismatch are negligible, although their results also show that a smaller number of linking variables can yield higher mismatch rates. Tromp et al. propose a method to address dependency between linking variables by generating another variable that places numerical scales on agreement and disagreement among two dependent variables [4] . Linking errors can take the form of either false positives or false negatives. Neter, Maynes and Ramanthan demonstrate that seemingly small error rates can introduce bias to statistical parameters [5] . A number of other studies have also examined the bias introduced by mismatch in record linkages [6–8] . However, the literature is largely silent on the potential reduction of omitted variable bias through statistical linkage. This is surprising since, presumably, this is a major reason for doing the linkage in the first place. In this paper, we focus on the net effect of record linkage on bias illustrated with an empirical example based upon a dataset of patients treated for the hepatitis C virus (HCV). The approach is to start with a sample where administrative (medical and pharmacy) claims and clinical data are directly linked. This is used to generate a baseline average treatment effect estimate which, for the purposes of illustration, we use to represent the coefficient measuring the true treatment effect. We then drop the clinical variables, representing the case where the researcher has access only to administrative claims. The difference in treatment effect estimates in these two models is our estimate of omitted variable bias. Finally, we statistically link the datasets using a methodology known as Fellegi–Sunter record linkage [9] . This represents the case where the researcher does not have direct access to patient level linking variables and uses statistical linkage as the method for imputing the missing variables. The resulting dataset is used to examine whether statistical linkage helps to reduce the omitted variable bias, despite the potential for introducing bias through imperfect linkage. Details on the linkage methodology are provided in the Supplementary Materials. Methods We illustrate the net effects of statistical linkage on bias in treatment effect estimates in a sample of patients with HCV. HCV is a blood-borne virus that causes slowly progressive chronic liver disease [10] . During 1999–2002, an estimated 2.7–3.9 million individuals in the USA had chronic HCV infection [11] . Prior to May 2011, the standard of care for HCV treatment included a dual therapy regimen of pegylated IFN

456

J. Comp. Eff. Res. (2015) 4(5)

plus ribavirin [12] . To investigate the impact of statistical linkage on missing variable bias, we estimated the average treatment effect associated with initiating one of two dual therapy regimens (pegylated IFN-α-2a plus ribavirin and pegylated IFN-α-2b plus ribavirin) among patients with HCV, controlling for severity of liver fibrosis as measured by the aspartate aminotransferase platelet ratio (APRI) value. APRI values are not normally available in claims data and would represent an important missing variable affecting bias in the estimated treatment effects if APRI is correlated with both treatment and outcomes. Statistical linkage was used to append the APRI variable to the claims. Average treatment effects were measured by the coefficient estimate for the dual therapy treatment variable. The retrospective observational study was conducted using administrative claims data, eligibility information and laboratory results for the period of 1 January 2002 through 31 May 2011, within a large USA managed care health plan affiliated with Optum, Inc. The individuals covered by this health plan, are geographically diverse across the USA, with greatest representation in the south and midwest US census regions. The plan provides fully insured coverage for professional (e.g., physician), facility (e.g., hospital) and outpatient prescription medication services. Medical (professional, facility) claims include International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) diagnosis codes, ICD-9-CM procedure codes, Healthcare Common Procedure Coding System (HCPCS) codes, place of service codes, provider specialty codes and health plan and patient costs. Outpatient pharmacy claims provide National Drug Codes for dispensed medications, quantity dispensed, drug strength, days supply, provider specialty code and health plan and patient costs. All study data were accessed using techniques that are in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA), and no identifiable protected health information was extracted during the course of the study. Because this study involved analysis of pre-existing, de-identified data, it was exempt from Institutional Review Board approval. In addition to medical claims for outpatient visits, inpatient visits, pharmaceutical prescriptions and enrollment information, the analytic file contained APRI values measuring liver fibrosis severity calculated from merged laboratory results data. These data were available for 2031 patients. Details on the sample selection criteria are provided in Supplementary Figure 1. In order to assess bias tradeoffs, we conducted bootstrapped replications in which we assumed that the average treatment effect estimate obtained from this sample of patients represented the ‘true’ population effect. We then drew 500

future science group

Can statistical linkage of missing variables reduce bias in treatment effect estimates in CER studies? 

random samples from this ‘population’ of sizes n = 500 and n = 1000, respectively. Treatment effects were estimated for each sample using negative binomial count models but dropping the APRI variable (treating it as missing). This enabled us to create an empirical sampling distribution for the regression estimator when the APRI variable was missing and measure both its variance and bias, as well as how these varied by sample size. Next, we introduced an APRI control variable through statistical linkage – again using the same samples used in the previous constructions of the empirical sampling distributions for the regression estimator. Our hypothesis was that statistical linkage should be less biased than the regression estimator with APRI treated as a missing variable. However, bias can also be introduced as part of the imputation itself. Consequently, the net impact on bias is a function of both potential bias reduction and bias creation. We hypothesized that statistical linkage of missing variables would reduce the missing variable bias in the average treatment effect (i.e., move it back in the direction of the ‘true’ estimate) but that this would come at the cost of larger standard errors for the treatment effect estimates. The magnitude of the efficiency loss would be a function of the strength of the covariances among the outcome variable, the treatment variable (i.e., dual therapy pegylated IFN plus ribavirin) and the linked APRI value. In addition, the statistical linkage could, itself, introduce bias. Thus, the bootstrapping approach adopted in our study measures the net effect of statistical linkage of missing variables. Sample size, in turn, would not affect the bias of treatment effect estimates but would influence standard errors [3] . That is, larger sample sizes would reduce standard errors making the bias more apparent because the sampling distributions of the different estimators overlapped less. Outcome measures, covariates & multivariate methods

We examined the impact of statistical linkage on issues of bias reduction in models of two economic outcomes

Research Article

all-cause ambulatory (including office and outpatient) visits and HCV-related ambulatory visits. Following the results of specification tests for count models, these equations were estimated using negative binomial count models [13] . Control variables from the medical claims included patient age, Quan-Charlson comorbidity score, gender, race/ethnicity, education level, household income and geographic location (region). From the laboratory results we were also able to calculate baseline APRI values. This was the key clinical variable that was used to examine the impact of statistical linkage on bias reduction in the various models. All-cause ambulatory and HCV-related ambulatory visits were measures of healthcare service utilization which were observable for all 2031 patients with HCV. We hypothesized that effective HCV-related treatment would be associated with reduced HCV-related ambulatory visits. Similarly, we hypothesized that healthcare utilization associated with HCV comorbidities or somatic complaints associated with HCV but not directly coded as such would be captured in the all-cause ambulatory care variable. Consequently, we hypo­thesized that effective HCV treatment would be correlated with a reduction in all-cause ambulatory visits. The Quan-Charlson comorbidity score was hypothesized to be positively correlated with both utilization measures but there were no prior hypotheses regarding the direction of impact of socio-demographic variables on HCV and all-cause ambulatory visits. Results For the purposes of this analysis, the study sample of 2031 individuals with HCV and APRI values is considered to be the true population. Table 1 reports the treatment effect estimates for the ‘dual therapy regimen type at initiation’ covariate across two outcome variables and three sets of assumptions regarding the availability of the APRI variable. The first column includes the APRI values calculated for patients where it was possible to calculate their actual values based upon laboratory results that were linkable to the claims data for these

Table 1. Treatment effect estimates, significance and bias by model type. Dependent variable 

‘True’ estimator (1)

Naive estimator (2)

Absolute bias Linked APRI from missing (4) APRI (3)

Absolute bias Bias with linked reduction (%) APRI (5) (6)

All-cause ambulatory visits

-0.0849 (0.0084)

-0.0949 (0.0032)

0.010024

-0.0925 (0.0001)

0.0076

23.79

HCV-related ambulatory visits

-0.1046 (0.0041)

-0.1157 (0.0016)

0.011052

-0.107 (0.0001)

0.0024

78.8

Numbers in brackets represent p-values. APRI: Aspartate aminotransferase platelet ratio; HCV: Hepatitis C virus.

future science group

www.futuremedicine.com

457

Research Article  Crown, Chang, Olson et al.

Table 2. Correlation coefficients for key variables.  

All-cause ambulatory visits (1)

HCV-related ambulatory visits (2)

APRI value (3)

Treatment (4)

All-cause ambulatory visits

1.0000

 

 

 

HCV-related ambulatory visits

0.65728

1.0000

 

 

APRI score

0.07229

0.11227

1.0000

 

Treatment

-0.01572

-0.03142

-0.03482

1.0000

Blank cells indicate data redundancy issue. APRI: Aspartate aminotransferase platelet ratio; HCV: Hepatitis C virus.

same individuals. Column two, denoted as naive estimator, reports the treatment effects for each of the models dropping APRI altogether. This would be analogous to the situation where claims data alone are analyzed without controlling for potentially important clinical variables that are unavailable in the claims data. The difference between column 1 and column 2 is the estimate of bias from the omitted APRI variable. In column 4, we present the treatment effect estimates including linked APRI values. We compared these estimates of the ‘true’ population values in column 1 and then calculated this difference to the absolute bias in column 3. One important point to note from Table 1 is that missing variables will not introduce bias into treatment effects unless the missing variable is correlated both with treatment and outcomes. As shown in the correlation matrix (Table 2), APRI value had a correlation of 0.072 with all-cause ambulatory visits and 0.112 with HCVrelated ambulatory visits. The correlation of APRI value

0.6

Treatment effect: HCV ambulatory visits 0.8

0.6

Bootstrapped estimates (n = 500) Bootstrapped estimates (n = 1000)

Probability

Probability

0.8

Treatment effect: all-cause ambulatory visits Bootstrapped estimates (n = 500) Bootstrapped estimates (n = 1000)

with treatment was -0.035. It is difficult to judge, on the basis of these simple correlations, how serious the omission of APRI might be for introducing bias in the treatment effect estimator. Of course, if the researcher is interested in the independent effects of APRI on allcause and HCV-related ambulatory visits, this would be reason enough for seeking to include the variable through some strategy such as statistical linkage. Figure 1 displays the sampling distributions of the statistically linked APRI value treatment effects estimators for the all-cause and HCV-related ambulatory visit models using bootstrapped samples of 500 and 1000 patients, respectively. All of the estimators remained slightly biased. However, in all cases inclusion of the statistically linked APRI variable reduced the bias that would have been present if the variable had been omitted altogether. Note that sample size had no effect on bias reduction but, as expected, larger samples resulted in estimators with smaller standard errors.

0.4

0.4

0.2

0.2

0 -0.3

0 -0.2

-0.1 Coefficient estimates

0.0

0.1

-0.3

-0.2

-0.1

0.0

0.1

Coefficient estimates

Figure 1. Estimates of bias in statistically linked aspartate aminotransferase platelet ratio scores for all-cause ambulatory visits and hepatitis C. HCV: Hepatitis C virus.

458

J. Comp. Eff. Res. (2015) 4(5)

future science group

Can statistical linkage of missing variables reduce bias in treatment effect estimates in CER studies? 

Research Article

Table 3. Aspartate aminotransferase platelet ratio coefficient. Dependent variable

‘True’ APRI estimator (1)

Bootstrapped APRI (n = 500 [2])

Bootstrapped APRI (n = 1000 [3])

HCV ambulatory visits

0.452865 (0.11735)

0.41886 (0.79109)

0.45334 (0.50346)

All-cause ambulatory visits

0.205329 (0.09893)

0.23041 (0.71747)

0.27701 (0.41147)

APRI: Aspartate aminotransferase platelet ratio; HCV: Hepatitis C virus. 

When APRI values were linked for patients with evidence of at least one ambulatory visit, very similar treatment effect estimates were obtained to those of the ‘true’ population model. Moreover, the standard errors for the treatment effect estimates in the ambulatory visit models using linked APRI values were still small and the treatment effect estimates highly significant. These results are a function of the strengths of the correlations among the outcome variable, APRI values and the treatment variable. The higher the covariances among these variables, the greater is the omitted variable bias from the missing control variable (APRI) and, consequently, the greater the potential benefit of reducing the bias through statistical linkage. Table 3 reports on the results for the APRI variable itself in each of the two outcome models under different assumptions regarding the availability of APRI values. As with Table 1, the first column of Table 3 reports estimates of the APRI parameter in models where it was possible to calculate APRI values for patients based upon their laboratory results and it was also possible to link these results to the claims data for these same indi-

viduals. Once again, for the purposes of this exercise, we assume that the parameter estimate calculated for APRI in this model is the ‘true’ population estimate. The issue of bias for APRI is, of course, extreme when the variable is completely missing. In effect, the parameter estimate is constrained to be zero in this instance. This is analogous to the situation where claims data alone are analyzed without controlling for potentially important clinical variables that are unavailable in the claims data. In columns 2 and 3, we present the APRI parameter estimates based upon linked APRI values for bootstrapped samples of 500 and 1000 observations, respectively. Figure 2 reports the empirical sampling distributions for the APRI coefficients in the estimators for all-cause and HCV-related ambulatory visits, respectively. As with the treatment effects discussed earlier, larger sample sizes had no effect on bias but, as expected, did result in coefficient estimates with smaller standard errors. If the researcher has an interest in an important variable not available in the dataset, its omission from the analysis is obviously a serious issue even if it has

APRI coefficient: HCV ambulatory visits 0.8

Bootstrapped estimates (n = 500) Bootstrapped estimates (n = 1000)

0.8 Probability

Probability

0.6

APRI coefficient: all-cause ambulatory visits

1.0

Bootstrapped estimates (n = 500) Bootstrapped estimates (n = 1000)

0.4

0.2

0.6

0.4

0.2

0.0

0.0 -2

-1

0 1 APRI coefficient estimates

2

3

-2

-1

0 1 APRI coefficient estimates

2

Figure 2. Estimates of bias in statistically linked aspartate aminotransferase platelet ratio scores for all-cause ambulatory visits and hepatitis C. APRI: Aspartate aminotransferase platelet ratio; HCV: Hepatitis C virus.

future science group

www.futuremedicine.com

459

Research Article  Crown, Chang, Olson et al. no correlation with the treatment variable. APRI was found to be a statistically significant regressor in all-cause and HCV-related ambulatory visits in the original dataset. However, although the statistically linked APRI variable was found to have similar point estimates to the original directly linked models, large standard errors rendered the coefficient estimates statistically insignificant. These results indicate that the value of statistical linkage for imputation vary in different situations. Key factors are the strength of the correlations of missing variables with the treatment and outcomes variables, as well as sample size. Discussion A significant challenge for producing reliable evidence from real world data analyses is overcoming bias from omitted variables. In the particular case of estimating treatment effects, any important variables correlated with both treatment and outcomes that are omitted from the analysis will introduce bias by definition. In this paper, we hypothesize that if one imputes a missing variable from a similar population, it should be possible to reduce the bias relative to omitting the variable completely – although this will likely come at the expense of larger standard errors. Moreover, statistical linkage of missing variables also enables the independent effects of these variables on outcomes to be assessed. The issue of missing variables is one that arises commonly in CER studies. For example, studies using claims data alone lack clinical control variables. For conditions where it is particularly important to control for disease severity or have clinical data to measure comorbidities (e.g., BMI as a control variable for studies of acute myocardial infarction risk) imputing missing variables through statistical linkage can help to reduce the bias that would otherwise be introduced by the missing variables. The results indicate that the signs and statistical significance of the linked APRI variable itself were broadly similar to results obtained with the actual APRI values. The ability of the APRI variable to control for bias in the treatment variable was related to the strength of its correlations with treatment choice and outcomes. In models where APRI was more highly correlated with treatment (all-cause- and HCV-related ambulatory visits), linked APRI reduced bias in the treatment effect estimates. However, because in this particular case, APRI value was not highly correlated with treatment, the bias introduced by completely omitting APRI value was small to begin with. In instances where the omitted variable or variables are more highly correlated with treatment, the benefits of statistical linkage would be potentially greater. In this analysis we had the luxury of having actual APRI values for all patients. In actual practice, the

460

J. Comp. Eff. Res. (2015) 4(5)

sample from which the missing variables are drawn is likely to be somewhat (or perhaps very) different than the sample to which they are linked. This is just one of many issues related to statistical linkage that may introduce measurement error into the model. Nonrandom measurement error may introduce a correlation between the treatment variable and the error term of the outcome equation. Bias will be reduced by statistical linkage only if this correlation is smaller than the correlation between the treatment variable and the error term in the model omitting APRI altogether. We suspect that the incorporation of missing variables through statistical linkage might be viewed as a lower strength of evidence by traditional evidence hierarchies. Although this would be true for comparable datasets with actual versus imputed data, the importance of addressing bias when important control variables are missing is another matter. Here, the extensive literature on imputation demonstrates that incorporation of missing variables through imputation (particularly multiple imputation) can reduce bias relative to ignoring the missing variables altogether. Limitations

Applying the Fellegi–Sunter method of record linkage raises some potential limitations that arise from assumptions of the method that may not hold in reallife applications of health data. The main assumption often violated in health data record linkages is conditional independence between the linking variables. This implies that any agreement or disagreement on one linking variable should not have any effect in the probability that another linking variable agrees or disagrees. In practice, this assumption often does not hold, particularly in health records linkage. Correlated variables are inherent in health records ranging from electronic medical records to claims data; this is especially true in studies of treatment effects of therapy or drug regimen. One of the primary issues that dependent linking variable raises is the introduction of additional bias in the linked records and statistical analysis. Much of the literature on imputation and bias reduction has focused upon multiple imputation [14] . We did not create multiple imputed datasets through statistical linkage because of the additional complexity that this would introduce into our bootstrap simulations. The specific findings of this study have the usual limitations common to analyses conducted with data from administrative claims. This study is limited to the degree to which claims data can accurately capture an individual’s medical history. Claims data are collected for payment and not for research. While these

future science group

Can statistical linkage of missing variables reduce bias in treatment effect estimates in CER studies? 

Research Article

Executive summary Background • Many of the analytic challenges faced in observational comparative effective research studies can be viewed as missing data problems. Depending upon the strength of correlations among missing variables, the outcome and treatment variables of interest, missing variables may introduce substantial bias into treatment effect estimates. • Direct control for the missing variables is the preferred solution. In some situations, this can be accomplished by directly linking the missing variables onto the analysis file at the patient level. However, for a variety of reasons this is often not possible. An alternative is to impute the missing variables through statistical linkage across datasets. • Most prior studies of statistical linkage have focused upon the potential bias introduced through imperfect linkage rather than the potential for reducing missing variable bias. In this paper, we focus on the net effect of record linkage on bias illustrated with an empirical example based upon a dataset of patients treated for the hepatitis C virus (HCV).

Methods • To investigate the impact of statistical linkage on missing variable bias, we estimated treatment of patients with HCV with one of two dual therapy regimens (pegylated IFN-α-2a plus ribavirin and pegylated IFN-α-2b plus ribavirin), controlling for severity of liver fibrosis as measured by the aspartate aminotransferase platelet ratio (APRI) value. • The study was conducted using medical claims, pharmacy claims, and eligibility information for the period of 1 January 2002 to 31 May 2011, within a large USA managed care health plan affiliated with Optum, Inc. In addition to medical claims for outpatient visits, inpatient visits, pharmaceutical prescriptions, and enrollment information, the analytic file contained APRI values measuring liver fibrosis severity calculated from merged laboratory results data. These data were available for 2031 patients. • We assumed that the average treatment effect estimate obtained from the linked sample of 2031 patients represented the ‘true’ population effect. We then drew 500 random samples from this ‘population’ of sizes n = 500 and n = 1000, respectively. Treatment effects are estimated for each sample using negative binomial count models but dropping the APRI variable (treating it as missing). This enabled us to create an empirical sampling distribution for the regression estimator when the APRI variable is missing and measure both its variance and its bias, as well as how these vary by sample size. • Next, we introduced an APRI control variable through statistical linkage-again using the same samples used in the previous constructions of the empirical sampling distributions for the regression estimator. Our hypothesis was that statistical linkage should result in less biased estimates than the regression estimate using claims alone (treating APRI treated as a missing variable). However, bias can also be introduced as part of the imputation itself. Consequently, the net impact on bias is a function of both potential bias reduction and bias creation.

Results • The results show that statistically linking APRI reduced treatment effect bias that was present when using claims data alone. However, the imputed APRI variable was not statistically significant in either model of allcause or HCV-related ambulatory visits. In instances where the omitted variable or variables are more highly correlated with treatment, the benefits of statistical linkage would be potentially greater.

Discussion • A significant challenge for producing reliable evidence from real world data analyses is overcoming bias from omitted variables. In the particular case of estimating treatment effects, any important variables correlated with both treatment and outcomes that are omitted from the analysis will introduce bias by definition. In this paper, we hypothesize that if one imputes a missing variable from a similar population, it should be possible to reduce the bias relative to omitting the variable completely-although this will likely come at the expense of larger standard errors. Moreover, statistical linkage of missing variables also enables the independent effects of these variables on outcomes to be assessed. • Missing data are a common problem in observational data analysis-particularly studies relying upon convenience samples from retrospective claims or electronic medical record databases. To the extent that missing variables are correlated with both treatment selection and outcomes their omission will result in biased estimates of treatment effect parameters. In this paper we demonstrate the conditions under which it may be possible to obtain unbiased treatment effect estimates in situations where data on important control variables are partially, or perhaps even completely missing, but have been linked. Using HCV as an empirical example, we find that using linked APRI values enabled us to reduce omitted variable bias in models of allcause and HCV-related ambulatory visits. Additional research is warranted assessing the generalizability of these methods to other datasets, for example imputing multiple variables from electronic medical records into claims datasets, or imputing claims variables into clinical trial datasets.

future science group

www.futuremedicine.com

461

Research Article  Crown, Chang, Olson et al. data are excellent for understanding ‘real world’ patterns of healthcare utilization, they are subject to possible coding errors, coding for the purpose of rule-out rather than actual disease and under coding. Indeed the central objective of the study was to address the fact that, often, key clinical variables are not readily available in claims data with resulting consequences for bias by omitting these variables. The data used for this study come from a commercially insured managed care population; therefore, results of this analysis are primarily applicable to the treatment of chronic HCV with dual therapy pegylated plus ribavirin in managed care settings and may not be applicable to patients in nonmanaged care settings. This issue is less of a concern given the methodological thrust of the current paper. In other applications of statistical linkage to address missing variables, however, one would need to assume that the population from which the donor samples were drawn was similar to the recipient population. For example, if the donor population was quite different (e.g., medical records from the Veteran’s Administration linked to the medical claims of a commercially insured population), it is possible that greater bias could be introduced by statistical linkage measurement error than the original bias from the omitted variables. As a result, considerable care needs to be given to selecting donor and recipient samples that are as similar as possible. Conclusion Missing data are a common problem in observational data analysis – particularly studies relying upon convenience samples from retrospective claims or electronic medical record databases. To the extent that missing variables are correlated with both treatment

selection and outcomes their omission will result in biased estimates of treatment effect parameters. In this paper we demonstrate the conditions under which it may be possible to obtain unbiased treatment effect estimates in situations where data on important control variables are partially, or perhaps even completely missing, but have been linked. Using HCV as an empirical example, we find that using linked APRI values enabled us to reduce omitted variable bias in situations where APRI values were more highly correlated with outcomes (all-cause- and HCV-related ambulatory visits), as well as obtain estimates of APRI on outcomes directly. Additional research is warranted assessing the generalizability of these methods to other datasets, for example, imputing multiple variables from electronic medical records into claims datasets, or imputing claims variables into clinical trial datasets. Supplementary data To view the supplementary data that accompany this paper please visit the journal website at: www.futuremedicine.com/ doi/full/10.2217/cer.15.23.

Financial & competing interests disclosure The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties. No writing assistance was utilized in the production of this manuscript. impact on the outcome of probabilistic record linkage studies. J. Am. Med. Inform. Assoc. 15(5), 654–660 (2008).

References 1

2

3

4

462

Viswanathan M, Ansari MT, Berkman ND et al. Assessing the risk of bias of individual studies in systematic reviews of health care interventions. agency for healthcare research and quality methods guide for comparative effectiveness reviews. AHRQ Methods for Effective Health Care. Publication No. 12-EHCO47-EF. www.effectivehealthcare.ahrq.gov/ Crown W, Obenchain R, Englehart L, Lair T, Buesching D, Croghan T. Application of sample selection models to outcomes research: the case of evaluating effects of antidepressant therapy on resource utilization. Stat. Med. 17(17), 1943–1958 (1998). Crown W, Henk H, Vanness D. Some cautions on the use of instrumental variables estimators in outcomes research: how bias in instrumental variables estimators is affected by instrument strength, instrument contamination, and sample size. Value Health 14(8), 1078–1084 (2011). Tromp M, Meray N, Ravelli AC, Reitsma JB, Bonsel GJ. Ignoring dependency between linking variables and its

J. Comp. Eff. Res. (2015) 4(5)

5

Neter J, Maynes SE, Ramanathan R. The effect of mismatching on the measurement of response error. J. Am. Stat. Assoc. 60(312), 1005–1027 (1965).

6

Larsen MD, Lahiri P. Regression analysis with linked data. J. Am. Stat. Assoc. 100(469), 222–230 (2005).

7

Scheuren F, Winkler WE. Regression analysis of data that are computer matched – part I. Surv. Methodol. 19(1), 39–58 (1993).

8

Scheuren F, Winkler WE. Regression analysis of data that are computer matched – part II. Surv. Methodol. 23, 157–165 (1997).

9

Fellegi IP, Sunter AB. A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969).

10

McHutchison JG, Bacon B. Chronic hepatitis C: an age wave of disease burden. Am. J. Manag. Care 11(10), s286–s295 (2005).

11

Armstrong GL, Wasley A, Simard EP, McQuillan GM, Kuhnert WL, Alter MJ. The prevalence of hepatitis C

future science group

Can statistical linkage of missing variables reduce bias in treatment effect estimates in CER studies? 

virus infection in the United States, 1999 through 2002. Ann. Intern. Med. 144(10), 705–714 (2006). 12

McHutchison JG, Lawitz EJ, Shiffman ML et al. Peginterferon alfa-2b or alfa-2a with ribavirin for treatment of hepatitis C infection. N. Engl. J. Med. 361(6), 580–593 (2009).

future science group

Research Article

13

Cameron AC, Trivedi P. Regression Analysis of Count Data (2nd Edition). Cambridge University Press, Cambridge, UK (2013).

14

Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley, NY, USA (1987).

www.futuremedicine.com

463

Can statistical linkage of missing variables reduce bias in treatment effect estimates in comparative effectiveness research studies?

Missing data, particularly missing variables, can create serious analytic challenges in observational comparative effectiveness research studies. Stat...
1KB Sizes 0 Downloads 6 Views