REVIEW For reprint orders, please contact: [email protected]

Evaluating ­methodological ­assumptions in comparative ­effectiveness research: overcoming pitfalls The scope of comparative effectiveness research (CER) is wide and therefore requires the application of complex statistical tools and nonstandard procedures. The commonly used methods presuppose the realization of important, and often untestable, assumptions pertaining to the underlying distribution, study heterogeneity and targeted population. Accordingly, the value of the results obtained based on such tools is in large part dependent on the validity of the underlying assumptions relating to the operating characteristics of the procedures. In this article, we elucidate some of the pitfalls that may arise with use of the most commonly used techniques, including those that are applied in network meta-analysis, observational data analysis and patient-reported outcome evaluation. In addition, reference is made to the impact of data quality and database heterogeneity on the performance of commonly used CER tools and the need for standards in order to inform researchers engaged in CER.

Demissie Alemayehu1 & Joseph C Cappelleri*2 Pfizer Inc., 219 East 42nd Street, New York, NY 10017, USA 2 Pfizer Inc., 445 Eastern Point Road, MS 8260-2502, Groton, CT 06340, USA *Author for correspondence: [email protected] 1

KEYWORDS: bias n comparative effectiveness n indirect treatment comparison n model assumption n model validation n network meta-analysis n observational studies n patient-reported outcomes n robustness

According to a definition by a committee of the Institute of Medicine (IOM), comparative effectiveness research (CER) pertains to “the generation and synthesis of evidence that compares the benefits and harms of alternative methods,” with the purpose “to assist consumers, clinicians, purchasers and policy-makers to make informed decisions that will improve healthcare at both the individual and population levels” [1]. The synthesis of evidence from systematic reviews naturally requires complex statistical tools in order to analyze and summarize all of the available and relevant information generated through different sources, including randomized clinical trials (RCTs), observational studies and other forms of data collection. In particular, the emphasis on ‘effectiveness’ (does the intervention work in practice?), rather than ‘efficacy’ (can the intervention work in controlled settings?) underscores the importance of comparing alternative treatment technologies using a much wider net of evidence than just RCTs [2]. Furthermore, the reference to ‘alternative methods’ implies that the scope of CER is broad enough to canvass all relevant interventions used in the standard of care. Explicit in the definition is also the focus on both ‘the individual and population levels’. Accordingly, while evidence generation in traditional RCTs relates to the population level, the heightened interest at the ‘individual’ level requires special data sources, such as patient-reported outcome (PRO) measures, which can be captured at the individual level as well as at the population level. PRO questionnaires provide a standardized way of capturing patient perspectives and experiences, allow outcomes that patients care about to be assessed and complement other clinical outcomes that, when taken together, form an overall comprehensive evaluation of patient care.

10.2217/CER.13.84 © 2014 Future Medicine Ltd

3(1), 79–93 (2014)

part of

ISSN 2042-6305

79

REVIEW  

Alemayehu & Cappelleri

The impact of CER depends on how possible it is to apply convincing data and methods to the right question at the right time [3]. The methodology of CER can be performed using a broad range of established and emerging methods, including: systematic reviews of existing research; decision modeling; retrospective analysis of existing clinical or administrative data; prospective observational studies, including registries, which (similar to retrospective studies) observe patterns of care and outcomes based on patients self-selecting themselves to treatment groups; and experimental s­tudies, including RCTs. In turn, then, this portfolio of CER methods needs to be placed in the context of meaningful involvement of all stakeholders (including patients, consumers, clinicians, payers and ­policy-makers), development of best practices and improvements in research infrastructure in order to enhance the validity and efficiency with which CER studies are implemented. These contextual virtues aim to help healthcare decisionmakers execute informed decisions at the level of individual care for patients and at the policy level for payers and other policy-makers. In order to be useful for decision-making, however, such evidence generated through CER must be valid, relevant, timely, feasible and actionable. Given the wide scope and ambitious intent of CER, it is often essential to apply nonstandard statistical tools and procedures. Similar to most statistical tools, the methods commonly used in CER rely on important, and often untestable, assumptions pertaining to the underlying distribution of the data, study heterogeneity and targeted population. Consequently, the value of results obtained based on such tools is in large part dependent on the validity of the underlying assumptions relating to the operating characteristics of the procedures. While there are numerous initiatives to formulate guidelines and best practices for methodological choice in CER, there is no general framework at present to guide data analysts and other researchers involved in CER activities in the evaluation of tools that are appropriate under varying conditions. In this article, we elucidate some of the pitfalls or concerns that may arise with the implementation of the most commonly used techniques, including those that are applied in network meta-analysis, observational data analysis and PRO evaluation. In addition, reference is made to the impact of data quality and database hetero­geneity on the performance of alternative

80

J. Comp. Eff. Res. (2014) 3(1)

CER tools and the need for standards to inform researchers engaged in CER. This article is organized as follows. In the ‘Indirect comparisons’ section, we discuss the assumptions underlying the so-called indirect treatment comparison techniques arising from network meta-analysis and suggest remedial measures in order to minimize the bias emanating from violations of the assumptions. The ‘Observational studies’ section deals with the issues associated with the analysis of data in observational studies. In the ‘Analytical issues with PROs in CER’ section, we highlight the importance of PROs in CER and address some of the methodological challenges. In the ‘Best practices’ section, we outline best practices and give a list of relevant guidelines. Finally, concluding remarks are given in the ‘Conclusion’ section. While the analytic pitfalls described in this aticle transcend CER, they also remain integral and fundamental to CER. Major elements of CER and their analytic considerations are captured through the examination of all available treatments (network meta-analysis), realworld effectiveness (observational data analysis) and patient-centered evaluation (PRO analysis). Indirect comparisons

It is well recognized now that RCTs and the systematic reviews of them are regarded as the gold standard for generating evidence for healthcare decision-making. However, in the context of CER, which requires comparisons of all relevant interventions, the results of RCTs in which all treatment options are compared directly in a head-to-head fashion is usually the exception rather than the rule. In the absence of trials involving a direct comparison of treatments of interest, alternative approaches have therefore been proposed in order to synthesize data through indirect comparisons (ITCs) [4], while preserving some of the benefits of randomization. In its simplest form, an ITC of two treatments that have only been compared with a common comparator involves the use of the relative effects on each of the two treatments versus the common comparator [5]. In more complicated cases, it may be of interest to combine the results of the direct evidence with those of the indirect estimates, in what is termed a mixed treatment comparison, with the additional ­benefit of ­getting more precise estimates [6,7]. This whole process of synthesizing evidence from a network of multiple RCTs, in which

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

treatments are compared directly or indirectly, involves using a generalization of a traditional meta-analysis based on an approach often referred to as ‘network meta-analysis’ [8]. An appropriate ITC or mixed treatment comparison is an integrated analysis that formally and correctly adjusts for treatment differences rather than an unadjusted analysis that merely provides a series of independent pairwise analyses for the purpose of comparing treatments that were not compared directly [9]. An exemplary ITC has been published on the comparative efficacy and a­cceptability of 12 new-generation ­antidepressants [10]. Despite their intuitively desirable features, the reliability of the results based on ITC methodology is heavily dependent on the validity of the assumptions underlying the construction and application of the technique. Some of the issues are similar to those that arise in the context of traditional meta-analysis, including heterogeneity of treatment effects across trials, publication bias, study quality and investigator bias. There are also additional assumptions required for proper ITC method implementation that may not be testable or verified based on the available data. Heterogeneity of treatment effects among studies within a direct comparison is a fundamental consideration in an ordinary (traditional) meta-analysis, as well as in a network meta-­ analysis (i.e., ITCs). Such heterogeneity occurs when different patient characteristics or study characteristics (or both types of characteristics) modify the relative treatment effects between a series of trials involving a direct comparison between two interventions. Conventional metaanalysis customarily involves the evaluation of the hetero­geneity of effects. In one approach, heterogeneity among studies within a direct comparison is typically built into the model as long as their treatment effects share a common typical value. This heterogeneity is captured through a random effects model, which allows each individual study to have its own treatment effect, with the distribution of those effects ­centered around a common value. On the other hand, heterogeneity between the sets of studies that contribute direct comparisons to an ITC in a network meta-analysis would indicate a lack of similarity. The crucial assumption underlying ITCs is that the relative efficacy of a treatment is the same in all trials included in the ITC – that is, the same effect size would be obtained if each one of the direct comparisons was performed under the trial conditions

future science group

REVIEW

that generated data for the other direct comparisons. In ITCs, violation of this similarity or exchangeability assumption can lead to biased results [11,12]. There is no direct way of testing the assumption of similarity. However, there are indirect ways of assessing whether distributions of known or measured confounders are comparable across sets of trials involved in an ITC [11,12]. This may involve both qualitative and quantitative evaluations of the factors that are known to modify the relative treatment effects. When data come from RCTs, the factors may be internal to the individual RCT or external to it. Internal factors include study design parameters, such as blinding, study duration, outcome measures, dosing schedules and patient characteristics. External factors that may impact effect sizes include healthcare policy, location and when the trials were conducted. A qualitative evaluation typically involves an understanding of the subject matter and clinical inspection on the comparability of those known and measured factors across sets of studies involved in an ITC. A key limitation of the approach is that there is no guarantee that measurements on all potential effect modifiers (e.g., severity of illness and duration of study), both known and unknown, may be available. Nonetheless, if the qualitative assessment suggests an imbalance in some of the known effect modifiers that are measured, efforts should then be made to account for this imbalance. In the absence of patient-level data, this undertaking is often challenging. When only aggregate (study-level) data are available, as is the case in most CER projects involving synthesizing information from the literature and other secondary publications, one recommended quantitative approach is meta-regression, despite its known limitations [13]. Other quantitative approaches include the following: evaluation on comparability of the consistency of effect estimates for the reference group across trials (as a measure of baseline risk or study-level effect); evaluation of heterogeneity in the original sets of meta-analyses that may have been used to combine all the evidence; and evaluation of tests for homogeneity in subgroups that mimic ­characteristics of other studies [11,12]. In addition, the qualitative and quantitative exercises may be complemented using simulation techniques [14]. For example, a simulated treatment comparison approach can be created

www.futuremedicine.com

81

REVIEW  

Alemayehu & Cappelleri

to incorporate the missing treatment arms into an existing trial [15]. This approach, driven by predictive equations from the completed trial of central interest (the ‘index’ trial), yields a simulated head-to-head comparison within a trial, estimates the results that would have been obtained in a head-to-head trial and addresses sources of variability among the different trials. Separate data for the comparators are used to calibrate the index equations in order to reflect the alternative interventions. Another assumption is the consistency between the direct evidence and the indirect evidence when both are available [16]. It is important that the results from these two sources are comparable. Inconsistency can be thought of as a conflict of ‘direct’ evidence on a comparison between treatments B and C versus ‘indirect’ evidence gained from trials that compared (A and C) and (A and B) separately. Similar to hetero­geneity, inconsistency may be caused by effect modifiers and specifically by an imbalance in the distribution of effect modifiers in the direct and indirect evidence. If the consistency assumption cannot be validated, or the evidence from the indirect source is not reliable, then a reliable synthesis of the evidence is suspect and open to question. Several techniques are available that have been developed in order to assess consistency under different scenarios [17–19]. The simplest method, referred to as the Bucher method for single loops of evidence, is essentially a two-stage method. The first stage synthesizes separately the direct evidence versus the indirect evidence in each pairwise contrast, while the second stage tests statistically whether the direct and indirect evidences are in conflict. However, this method can be applied only to three independent sources of data (e.g., the B–C direct effect is compared with its indirect effect from the A–B and A–C direct evidence). Three-arm trials cannot be included because they are internally consistent and will reduce the chance of detecting inconsistency. As a more advanced approach, the back-calculation method can be viewed as an extension of Bucher’s networks with multiple loops of any size. This method works well for simple fixed-effect models with no multiarm trials. It can, in principle, be extended to incorporate random-effect models and multiarm trials, but suffers from several technical disadvantages [18]. The method of node splitting overcomes these disadvantages [18,19]. It allows the user to split the information contributing to estimates of a

82

J. Comp. Eff. Res. (2014) 3(1)

parameter (node) into a direct component (e.g., based on all the A–B data from A–B, A–B–C and A–B–D trials) and an indirect component based on all of the remaining evidence. A drawback of the node-split method is that it can become quite computationally intensive if many nodes need to be split. In addition, when nodes for which indirect information is scarce are split, convergence of the updating algorithm can become slow. Yet another approach for detecting inconsistency is to compare the standard consistency model with a model not assuming consistency using a Bayesian framework [18,19]. However, in general, it is difficult to identify inconsistencies, which may in large part be due to natural variability, especially when associated with the heterogeneity of the patient populations enrolled in the different studies. Furthermore, as is generally the case with sparse data using Bayesian methods, especially when a randomeffects analysis is used, insufficient data are likely to mask all but the most obvious signs of inconsistency. It should be emphasized that an ITC investigation should be complemented with thorough sensitivity analyses in order to ensure the robustness of results under alternative scenarios. In addition to performing analyses stratified by potential effect modifiers (e.g., gender, age, severity of illness and duration of study), researchers can evaluate results using alternative model formulations (e.g., random-effects models vs fixed-effects models, which assume a fixed or single common effect across studies for a given pair of treatments). Although frequentist approaches appear to be routinely used in most meta-analysis projects, Bayesian procedures are being increasingly used in relatively complex models, especially those arising in mixed treatment comparisons. Noninformative priors are often postulated for most parameters of interest, including treatment effects. In some cases, the choice of priors may be informed by historical data. In all cases, it is good practice to perform a sensitivity analysis in order to better understand the influence of the choice of the prior on study conclusions. Box 1 includes a series of some fundamental questions to consider in the analysis of ITCs from RCTs. Observational studies

A major limitation of RCTs in the context of CER is that the conditions under which they

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

REVIEW

Box 1. Some key questions to ask regarding the analytic assumptions in indirect treatment comparisons. Homogeneity ■■ Is the method used to determine the presence of statistical heterogeneity adequate? ■■ Is the homogeneity assumption satisfied or is heterogeneity, if present, accounted for? Similarity ■■ Is the assumption of similarity stated? ■■ Is a reasonable approach used to assess the assumption of similarity? Consistency ■■ Is consistency of direct effects and indirect effects assessed? ■■ If consistency is reported, is the evidence combined? ■■ If consistency is not reported, is the evidence not combined? ■■ Are patient or trial characteristics between the direct trials and the indirect trials combined? ■■ Are studies with more than two treatment arms correctly analyzed?

are conducted may not be reflective of the circumstances under which drugs will be used in a real-world setting. As a consequence of this lack of external validity of RCTs, there is increasing reliance on observational studies to fill the evidentiary gap. Effective use of data from observational studies, however, requires addressing major conceptual and methodological assumptions, as well as infrastructural issues. Such scientific prudence has led to initiatives intended to strengthen the evidentiary value of data from non-RCT sources and the development and implementation of guidelines on, for example, retrospective data analyses [20]. A systematic review of observational studies on medical interventions showed that the quality of reporting on confounding bias was poor, calling into question the credibility and validity of results [21]. On the other hand, an exemplary observational study with high-quality reporting on confounding bias, with appropriate analyses to address it, has been published on primary and facilitated percutaneous coronary intervention [22]. A major shortcoming of observational studies is the uncertainty regarding the comparability of treatment groups with regards to potential confounding factors, both known and unknown. In addition to the effect-modifying confounders between sets of trials outlined in the ‘Indirect comparisons’ section, a confounder in a single study may be time dependent, often producing a type of bias that is often referred to as ‘confounding by indication’ [23]. Because the allocation of treatment in observational studies is not randomized and the indication for treatment may be related to the risk of future health outcomes (benefits and harms), the resulting imbalance in the underlying risk profile between treated and comparison groups can generate biased

future science group

results. Unlike RCTs, which enjoy the benefit of randomization to guard against such problems, in observational studies, measures should be in place to minimize the bias introduced both by measured and unmeasured confounders. Although there are alternative analytical strategies that have been proposed to control or minimize the effects of confounders in observational studies, such as the use of restriction to address confounding [23], there is no method that can guarantee the effective elimination or mitigation of the resultant bias [24]. A traditional approach to handle measured confounders involves matching a treated subject with a control subject with respect to a prespecified set of confounders. Stratification is another approach that requires categorizing relevant variables in order to form strata and performing stratum-specific assessment of the treatment on outcome. If results across strata are similar, it would be appropriate to combine them across strata. Although matching and stratification may have the added advantage of enhancing the precision of estimated effects, uncritical use of the approaches may have undesirable consequences. For example, overmatching on exposure can result in showing no treatment effect when in fact one exists, and having too many strata may result in unstable or unreliable estimates [25]. A related approach of controlling for measured confounders is use of propensity scores, in which classification of subjects into categories is achieved in a data-driven manner [26,27]. This method generally cannot remove hidden biases, and recently has been shown to be sensitive to how it is implemented in the analytical model. For example, research based on Monte Carlo simulations have shown that stratification on the propensity score and covariate adjustment using

www.futuremedicine.com

83

REVIEW  

Alemayehu & Cappelleri

the propensity score typically result in a biased estimation of hazard ratios compared with the inverse probability of treatment weighting using the propensity score or with propensity score matching when estimating the relative effect of treatment on time-to-event outcomes [28]. In another study, again based on simulations, the results suggest that forcing balance between treated and untreated groups using propensity score analyses may tend to worsen the imbalance in unmeasured covariates [29]. More specifically, the authors argue that if the unmeasured covariates are confounders, then the use of propensity score techniques can exacerbate the bias in treatment-effect estimates. Despite the popularity of the propensity score methodology in the analysis of observational data, it therefore becomes prudent to assess the appropriateness of the approach vis-à-vis alternative techniques, such as conventional multivariate regression models. For instance, propensity score methods allow for the estimation of the marginal treatment effect (i.e., the average effect of the treatment in the study population), while multivariate regression provides estimation of the conditional treatment effect (i.e., the expected effect of the treatment for any individual) [30]. While marginal and conditional estimates of treatment effects for differences in means or in proportions are expected to coincide when models are correctly implemented, such is not expected to be the case for binary or time-to-event outcomes. Thus, the preference between propensity score methods and multivariate regression methods may depend, in part, on the type of inference to be made (marginal vs conditional). Relative to traditional multivariate regression methods, propensity score methods offer potential advantages: they may be easier for the diagnosis of the model specification with respect to confounding, may provide more flexibility and reliability when the treatment is common and the binary or the time-to-event outcome is rare and may allow (as in a randomized controlled trial) separation of the study design from the study analysis (the propensity score can be constructed without any reference to the outcome) [30]. On the other hand, multivariate regression methods may be more discerning when the treatment interacts with a covariate on the outcome and for describing the effects of other covariates on the outcome [31]. Both propensity score methods and multiple regression techniques are

84

J. Comp. Eff. Res. (2014) 3(1)

subject to the issues encountered with omitted variables and mismeasured ­variables, as well as with complex survey designs [31]. The method of instrumental variables has been proposed for the purpose of adjusting for both measured confounders and unmeasured confounders [32,33]. However, it is critical to keep in mind that any application of instrumental variables depends strongly on additional assumptions above and beyond those conditions required by ordinary least squares regression – additional assumptions that are needed for a variable to serve satisfactorily as a credible instrument. The technique relies on the identification of a variable that is highly correlated with treatment while not independently affecting the outcome and, in addition, not being associated with the observed confounders. It is often the case that the most challenging part for the empirical researcher is to argue successfully that the instrument ultimately and indirectly predicts the outcome in a way such that its influence passes only through the treatment, rather than passing directly to the outcome from the instrument itself. The arbitrariness of the way a proper instrument is identified or validated has limited the wide application of the procedure [34]. Other approaches for observational studies include marginal structural models and structural equation models [20,24]. A marginal structural model can be applied to generate a pseudopopulation via inverse treatment probability weighting and is intended to mimic RCTs and to consistently estimate the causal effects of time-dependent confounders that are themselves affected by treatment [35]. This type of model, however, is based on the strong assumptions requiring correct model specification and no unmeasured confounders, which can never be fully proven. A structural equation model can be applied to a path analysis linking drugs to medical outcomes via intermediate or mediator variables [36]. This type of model allows, in principle, an assessment of further attributes that goes beyond the mean difference in the dependent (outcome) variable between treatment groups. Nonetheless, such an approach is dependent on the theoretical specification of the model (which variables get included and how they are related) and its accompanying goodness of fit with the data. While effectively executed analytical strategies may help to minimize the impacts of overt and hidden biases in observational studies, such strategies should be complemented with sound

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

epidemiological designs, which are the prerequisite to sound analysis. Recently, a series of articles on alternative designs to enhance CER has been published and can be consulted [37]. In addition to the methodological challenges made in observational study analyses, database heterogeneity can have substantial influence on outcomes. A recent study sought to isolate the effect of data source on outcome after holding all other aspects of the study design constant [38]. The results indicate that different databases could generate different results, with potentially strikingly different clinical implications, making clinical studies that use observational databases sensitive to the choice of database. One explanation for the difference is that databases on administrative claims data and electronic health records originate from different data-capture processes intended for different purposes, with research not being a primary intention. More­ over, data capture in each individual database has its own population with varied patient demographics, underlying disease severity, length of longitudinal data, geographic composition and quality of care – all of which can affect the relationship between drug and outcome. It should also be emphasized that when working with observational databases, the intent is not the estimation of marginal or overall effects for a population of interest, as is the case with RCTs, which also have known issues of external validity. In the case of observational studies, all estimated probabilities and other parameters are conditional and specific to the particular inclusion criteria and other idiosyncrasies found in a particular database. Box 2 presents some relevant questions to ­consider in the analysis of observational data. Analytical issues with PROs in CER

PRO data can buttress the evidence generation effort in CER. Since PRO instruments are intended to capture concepts related to how a patient feels or functions in order to help establish the burden of illness and impact of treatment on some aspect of the patient’s health status, the data generated can be useful for providing complementary information to other clinical data [39–41]. Recommendations have been made for incorporating PROs in CER as a guide for researchers, clinicians and policy-makers in general [39,40], and in adult oncology in particular [40]. Emerging changes that may facilitate CER using PRO measures include the implementation

future science group

REVIEW

of electronic and personal health records, hospital- and population-based registries and the use of PRO measures in national monitoring initiatives. Additionally, in CER, assessment of the hetero­geneity of treatment effects is an important step for making optimal treatment choices for individuals and patient subgroups. In this regard, use of PRO measures may also be useful. For example, baseline PRO values may provide information that is not readily available through other baseline clinical variables [42]. The collection, analysis and interpretation of PRO data, however, can pose distinctive challenges that require complex assumptions and specialized tools. The issues range from the assumptions underlying the development and validation of PRO instruments to those undergirding the analytical models for formulation and implementation. In some CER initiatives, it may be essential to develop a new PRO instrument for a given therapeutic area of interest. This typically requires a robust and theory-based conceptual framework that involves considerable input from patients. During the process of instrument development and validation, exploratory and confirmatory factor analyses are often conducted in order to examine the factor structure of which items go with what domains, along with standard psychometric methods to assess test–test reliability, internal consistency reliability, validity (e.g., responsiveness [within-individual change] and sensitivity [between-group difference]) and the clinically important difference of a PRO measure. Assumptions regarding these analyses and methods, although not unique to CER, are worth highlighting here as they are also relevant for CER. The fundamental assumption underlying factor analysis is that one or more underlying factors can account for the patterns of covariation among a number of observed variables. Covariation exists when two variables vary together. Therefore, before conducting a factor analysis, it Box 2. Some key questions regarding the assumptions in observational analyses. Is a prespecified statistical analysis plan used? What method or model was used to address the confounding bias? ■■ To what extent were the assumptions of the method or model checked and verified? ■■ What factors could have affected the plausibility of those assumptions? ■■ Are sensitivity analyses using those factors performed? ■■ ■■

www.futuremedicine.com

85

REVIEW  

Alemayehu & Cappelleri

is important to analyze the data for patterns of correlation. If no correlation exists, then a factor analysis is needless. If, however, at least moderate levels of correlation among variables are found, factor analysis can help to uncover the ­underlying patterns that explain these relationships. Given that a factor-analytic model assumes that data are continuous (interval level) and normally distributed, a natural question is whether factor analysis is well suited for the analysis of PROs, where most variables are (strictly speaking) often neither continuous measures nor normally distributed. While more research is welcomed on the consequences of violating these two assumptions, empirical results and simulation studies suggest that factor analysis is relatively robust to reasonable degrees of departure from interval-level data and normality when the sample size is sufficient; therefore, standard factor analysis is generally applicable to ordinal-level data [43]. For an ordinal categorical scale, a logistic modeling approach can also be conducted using polychoric correlations [44,45]. Factor analysis is a linear procedure of each observed item or variable regressed on a set of factors, so a linearity assumption is required, as in the case of multiple linear regression. As with structural equation modeling in general, confirmatory factor analysis cannot prove causation. Rather, a major purpose of confirmatory factor analysis is to determine whether the causal inferences of a researcher are consistent with the data. A host of model fit indexes exists in order to assess consistency between the model and the data. Among the most primary of them are the following: the goodness-of-fit index, the comparative fit index, the normed fit index, the non-normed fit index and the root mean square error of approximation (RMSEA). For the first four of these indices, values above 0.90 generally indicate an acceptable fit [46]. Of these, the comparative fit index generally has the best properties and is the one that is most commonly reported. For the RMSEA, values below 0.10 can be considered desirable and 90% CIs for the true RMSEA are often obtained [47]. If the confirmatory factor model does not fit the data, then revisions are needed because this means that one or more of the model assumptions are incorrect or need refinement. Even if the confirmatory factor analysis model is consistent with the data, this still does not prove causation. Instead, it shows that the assumptions made are not contradicted and may be valid,

86

J. Comp. Eff. Res. (2014) 3(1)

without ruling out the notion that other models and assumptions may also fit the data. Test–retest reliability or reproducibility of a PRO measure, which usually involves two visits, with an intraclass correlation coefficient (the recommended statistic) assumes no relationship between the differences in values across the two visits (which represents measurement error) and the mean of the values across the two visits (which represents the true value). If this assumption is violated, the data can be transformed accordingly before testing with an intraclass correlation coefficient. In the case of more than two visits, the intraclass correlation is assumed to be the same between any pair of time measurements. When a clinically important difference is based on a regression model with the PRO measure as the outcome and with the anchor (an external measure that is interpretable and related to the PRO measure) as the predictor, the anchor as a continuous predictor may be taken to have a linear relationship with the PRO response in order to simplify the interpretation and to enrich the sensitivity of the relationship. Such an assumption, however, should be verified by taking the anchor as a categorical predictor in a sensitivity analysis. Doing so will allow the two results to be compared in order to learn whether the linearity assumption matters. Item response theory (IRT) using computerized adaptive testing is another approach to PRO standardization in CER [48] that allows measurement error to vary (and not be constant, as is the case with classical test theory) according to the true value of health status. IRT models enable the reduction and improvement of items according to a single (unidimensional) concept. An exemplary illustration of IRT is the Patient-Reported Outcomes Measurement Information System (PROMIS), where efficient, flexible and precise item banks and their short forms are developed for use in clinical research and practice [49]. The benefit of IRT, in turn, relies on three major assumptions that need to be met: ­unidimensionality, local independence and model fit. The assumption of unidimensionality requires that a scale consists of items that tap into only one dimension, which can be examined through factor analysis [46]. The assumption of local independence (which is essentially equivalent to that of unidimensionality) means that, once we remove the influence of the underlying attribute or factor being measured (i.e., the first factor in a factor analysis), there should be no association

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

among the responses from any pair of items. In other words, for a subsample of individuals who have the same level on the attribute (which is one way to remove or control for the effect of the attribute), there should be no correlation among the items. Local independence can be examined through the correlation matrix of residuals for the set of items (the ‘residuals’ being the difference between the observed response and modeled or predicted response), with a ‘large enough’ correlation (e.g., 0.20) between residuals being suggestive of local dependence [50]. The assumption of the correct model fit specification is not unique to IRT models. Assessing fit in IRT models is performed at both the item level and the person level. The fit between the model and the data can be assessed in order to check for model misspecification. An examination of residuals is important as the objective of IRT modeling is to estimate a set of items and person parameters that reproduce the observed item responses as closely as possible. Graphical and empirical approaches are available for e­valuating item fit, person fit and overall model fit [51]. In PRO data analysis, another major issue is the handling of multiple end points and the accompanying problem of missing values. In CER, which involves comparing a wide range of treatment options, the use of multiple end points can complicate the interpretation of results. The available statistical procedures for multiple testing that are routinely used depend on several factors, including methodological assumptions and prespecified decision rules [52]. The related problem of missing data, while not unique to PRO measures, is particularly germane in this case, because missing values may occur both at the questionnaire level (all items are missing) and the individual item level (only some of the items on a questionnaire are missing). The usual approach to handling missingness is a function of certain assumptions about whether the data are missing at random or not. The success of adding PRO measures to the electronic medical record, as well as to clinical studies, for CER depends on capturing an unbiased stream of responses from respondents. When data are missing at random, techniques such as multiple imputation and mixed effects models (e.g., repeated measures models and random coefficient models) have been proposed. However, when this assumption is not satisfied – which is often the case – the results can be biased, and techniques that assume data are

future science group

REVIEW

missing not at random should be also considered [53,54]. However, these techniques, which include pattern mixture models, selection models and shared parameter models, rely on assumptions (e.g., knowing the missing data mechanism and therefore the correct distribution of all values, including the missing ones) that cannot be ­verified because the data are simply not available. In a case study of a trial of adjuvant therapy for breast cancer, the assumption that the data were missing at random was adequate to obtain unbiased estimates [55]. On the other hand, in a case study of a trial in patients with advanced non-small-cell lung cancer, the estimation of the change in quality of life scores over time was sensitive to the assumptions underlying the model, and the assumption that data were missing at random was not justified [55]. Box 3 presents several questions to consider when conducting or appraising an investigation using PRO measures. Best practices

As indicated in the preceding sections, most analytical techniques in use or with potential use in CER are inherently associated with important assumptions, some of which may not be readily verifiable based on the available data. Understandably, there is no general framework to guide practitioners in terms of the choice of optimal procedures that can reliably be used for Box 3. Some key questions regarding the assumptions in patient-reported outcome analyses. Is there an appreciable correlation among items to justify an exploratory or confirmatory factor analysis? ■■ Is the sample size sufficient to provide a meaningful factor analysis? ■■ Is an intraclass correlation coefficient used to assess test–retest reliability? ■■ If an intraclass correlation coefficient was assessed to test–retest reliability, is there evidence that differences in the test–retest scores were constant over the range of the attribute? ■■ If an item response theory was performed, are unidimensionality and local independence assessed? ■■ If an item response theory was performed, is the sample size large enough? ■■ How was a clinical importance difference determined? ■■ What assumptions were made about a clinically important difference (e.g., a linear functional form if a regression model was used)? ■■ If the data are longitudinal, did the model account for the correlated responses over time? ■■ What assumptions did the model make regarding missing data? ■■ If data are missing, were sensitivity analyses performed? ■■ Are multiple testing of domains and time points addressed? ■■ What factors could have affected the plausibility of those assumptions? ■■ Are sensitivity analyses using those factors performed? ■■

www.futuremedicine.com

87

REVIEW  

Alemayehu & Cappelleri

each and every situation. However, it is critically important to follow certain best practices in order to enhance the credibility of analytical results and to minimize the dissemination of biased results that could have untoward consequences on health outcomes. Below, we outline a few best practices and relevant guidelines that may be of value to both data analysts and other stakeholders who use analytical results in CER. ■■ Protocol development

An important best practice for the analysis of any data in CER is the development of a protocol that specifies upfront, among other elements, the research hypothesis of interest, outcome variable, sample size considerations and analytical procedures. In particular, the protocol should state the rationale behind the choice of a statistical model, the underlying assumptions of the model and steps that need to be taken in order to validate the assumptions. It is critical to clearly state the process used to identify and define confounders, and how they will be handled in the planned statistical model. For a well-defined hypothesis equipped with a certain amount of effect deemed clinically meaningful, such a clinically important difference (if known) should be stated in order to arrive at a sample size that is large enough for sufficient statistical power. To the extent possible, confounding factors should be identified a priori and should not be based on post hoc data analyses. Furthermore, there should be a discussion of the source of data and its adequacy to address the research hypotheses of interest. ■■ Planned analyses

As a best practice, the data should be analyzed and reported according to the prespecified plan. Once the planned analysis is executed, all model assumptions should be considered and, when appropriate, justified and validated. In some cases, it may be necessary to deviate from planned analyses, thereby interjecting post hoc analyses and inferences. When there is a need for deviation from the planned analysis, the rationale for doing so and the consequence on the interpretation of results should be stated. It is also good practice to conduct simple descriptive and stratified analyses prior to undertaking more complex modeling. ■■ Sensitivity analyses

Given the limitations of existing approaches for use in CER, the analyst should make an extra

88

J. Comp. Eff. Res. (2014) 3(1)

effort to enhance the credibility of the findings through prudent sensitivity analyses. These analy­ses are intended to confirm (or not confirm) the reproducibility of results under alternative model assumptions and the use of competing statistical approaches. Sensitivity analyses may also include performing the analyses in different databases (for observational studies) and, when practicable, performing Monte Carlo simulations. When there are missing values, the sensitivity analyses should consider alternative techniques for missing data, under different assumptions about the missing data mechanism. ■■ Reporting of results

In view of the unsettled issues with the statistical approaches discussed in this article, it is important that results be transparent and reported and interpreted with fair balance. In particular, the strengths and weaknesses of the analytical approaches, as well as any departure from the planned analysis, should be provided. For exploratory analyses, the distinction between association and causation should be elucidated. In general, results should be presented in light of all of the available information, with particular reference to their biological plausibility and consistency with the current understanding of the disease process. If the results of sensitivity analyses are not consistent with those of the primary analyses, the implications of the discrepancy should be reported. ■■ Data quality assurance

In any data analysis initiative, the soundness of the results is a function of not only the validity of the methodological assumptions, but also the extent of the quality of the data [3]. Data quality is especially germane to the analysis of observational studies, since such data are complex and require considerable effort to process. Assurance on data quality is paramount and should be an integral part of the analysis and reporting of CER results. ■■ Use of guidelines

Analysts and researchers should take advantage of important guidelines that are now available to assist with the design, analysis and reporting of data for potential use in CER [56]. For ITCs, relevant guidance documents include the two reports issued by the International Society of Pharmacoeconomcis and Outcomes Research (ISPOR) taskforce on good research practices  [57,58], the

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

series of seven tutorial papers on evidence synthesis methods for decision-making [59] and the nine articles that offer a perspective on the current state of the science with important new developments in original and review articles from leaders in the field [60]. For observational data analysis, important recommendations are included in several guidance documents from ISPOR [20,61–63] and elsewhere [24]. The Strengthening the Reporting of Observational studies in Epidemiology (STROBE) statement provides a guideline aimed at improving the reporting of observational intervention studies [64]. For PRO measures, several guidelines are available pertaining to clinical research [65,66,101–103] and clinical practice from the International Society for Quality of Life Research (ISOQOL) [39,67] and from the perspective of adult oncology [40]. Best practices for subgroup analyses and ­reporting have also been proposed [68,69]. While the preceding is but a partial list, there will be continued activities by institutions, such as the Patient-Centered Outcomes Research Institute (PCORI) and the Agency for Healthcare Research and Quality (AHRQ), to issue documents on best practices relating to issues arising in the analysis of data for purposes of CER. For instance, an AHRQ-funded 2012 symposium on research methods for CER and patient-centered outcomes research focused on original research, methodological insights or advances arising from the conduct of CER and how CER can better support healthcare decision-making. Based on this symposium, a special issue of the Journal of Clinical Epidemiology featured specialized topics on CER related to its implementations, its challenges in mental health, its role within healthcare delivery system changes, its analytic issues and its design issues [37]. Researchers should, therefore, remain vigilant regarding emerging ideas and the ways of incorporating them as they become available. Conclusion

In view of the anticipated impact of CER on health benefits and harms in the USA in the coming years, the issue of the reliability of methodological tools that will be in use in order to inform decision-making will continue to spark important debate within academic and nonacademic circles. In this article, we highlighted some of the underlying assumptions and challenges with commonly used techniques in CER, and suggested steps that may be taken in order

future science group

REVIEW

to help mitigate the possibility of disseminating biased research results. However, the narrow scope of this review article did not permit a more exhaustive elucidation of the pitfalls associated with the commonly used procedures in CER. Indeed, more work is required in order to further assess and expound the methodological assumptions in existing and emerging analytical approaches in CER, with emphasis on their diagnosis, validation and remediation. More importantly, such work should also include a systematic review of the available literature, which has not been performed in this article, in order to further illuminate and extend upon these issues, as well as to provide a uniform assessment. Although methodological issues are important in any research involving data analysis, the matter is particularly important in CER, given the breadth of the problems targeted and the far-reaching consequences of the research results. Decision-making in CER requires evidence syntheses from RCTs. It also requires evidence from quasiexperimental designs (where investigators assign subjects to interventions nonrandomly) and retrospective databases, which require the application of nonstandard and complex statistical procedures. While the efforts of organizations such as the PCORI and AHRQ are important, more collaboration is still needed among the various stakeholders (including academia, healthcare providers, regulatory agencies and other relevant institutions) in order to develop standards and formulate policies regarding the proper use of existing tools and the i­mplementation of new ones. The primary purpose of CER is to help healthcare decision-makers in making informed clinical and health policy decisions. In order for CER to achieve its main purpose, the methods and infrastructure for CER need to be improved and will require sustained attention in terms of the following issues: meaningful involvement of patients, consumers, clinicians, payers and policy-makers in key phases of CER study design and implementation; development of methodological ‘best practices’ for the design of CER studies that reflect decision-maker needs and balance internal validity with relevance, feasibility and timeliness; and improvements in research infrastructure in order to enhance the validity and efficiency with which CER studies are implemented [3]. In addition, part of that enhanced validity is the attention and evaluation

www.futuremedicine.com

89

REVIEW  

Alemayehu & Cappelleri

of the methodological assumptions ­inherent in the analytical methods and models used. Future perspective

In the coming decades, CER will present considerable challenges and opportunities to healthcare providers, policy-makers, researchers and other stakeholders involved in methodological research. The framework for addressing the methodological issues will require novel paradigms of data analysis that will include enhancements of existing techniques and the development of new ones. The tools that are being explored to handle ‘big data’ are expected to play critical roles in CER, as the reliance on RCTs increasingly diminishes in favor of nonstandard data sources for evidence generation. In addition, advances in precision or personalized medicine will help broaden the prospect for addressing issues of heterogeneity of treatment effect in a new framework. In the

near term, however, there will still be a continued focus on the issues addressed in this article, including the development of tools that reliably assess consistency and similarity in ITCs, control for bias in observational studies and incorporate PROs appropriately in CER decision-making. Acknowledgements The authors would like to thank the anonymous reviewers for helpful and insightful comments.

Financial & competing interests disclosure D Alemayehu and JC Cappelleri are employees and shareholders of Pfizer Inc. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the ­manuscript apart from those disclosed. No writing assistance was utilized in the production of this manuscript.

Executive summary Background ■■ The broad scope of comparative effectiveness research (CER) requires complex and nonstandard procedures that presuppose the realization of important, and often untestable, assumptions. Indirect comparisons ■■ For the proper implementation of indirect treatment comparison methods, it is necessary to assess whether the assumptions of homogeneity, similarity and consistency are satisfied. Observational studies ■■ When data from observational studies are incorporated for evidence generation, the methods used to address confounding bias should be carefully evaluated. Patient-reported outcomes ■■ Effective use of patient-reported outcomes in CER decision-making presumes the application of sound statistical techniques for instrument development and validation, as well as for subsequent analysis on treatment effects. Conclusion ■■ In view of the unsettled methodological issues in the generation of evidence for CER, efforts should be made to leverage existing best practices and guidance documents, and also to develop new tools and guidelines in order to assist data analysts and other CER stakeholders. transformational change. Ann. Intern. Med. 15, 206–209 (2009).

References Papers of special note have been highlighted as: of interest of considerable interest n

n

n n

1

n n

2

90

Institute of Medicine (IOM). Initial National Priorities for Comparative Effectiveness Research. The National Academies Press, Washington, DC, USA (2009). Provides both the background for the evolving comparative effectiveness field and clarification of its goals and approaches. Luce BR, Kramer JM, Goodman SN et al. Rethinking randomized trials for comparative effectiveness research: the need for

3

Addresses several fundamental limitations of traditional randomized clinical trials for meeting the objectives of comparative effectiveness research (CER) and offers three potentially transformational approaches to enhance their operational efficiency, analytical efficiency and generalizability for CER. Tunis SR, Benner J, McClellan M. Comparative effectiveness research: policy context, methods development and research infrastructure. Stat. Med. 29, 1963–1976 (2010).

J. Comp. Eff. Res. (2014) 3(1)

n

Provides a description and context for the issues that will improve the methods and infrastructure for CER.

4

Bucher HC, Guyatt GH, Griffith LE, Walter SD. The results of direct and indirect treatment comparisons in meta-analysis of randomized controlled trials. J. Clin. Epidemiol. 50, 683–691 (1997).

5

Song F, Altman DG, Glenny A, Deeks JJ. Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published meta-analyses. BMJ 326, 472 (2003).

6

Caldwell DM, Ades AE, Higgins JPT. Simultaneous comparison of multiple

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

treatments: combining direct and indirect evidence. BMJ 331, 897–900 (2005). 7

8

9

10

20 Johnson ML, Crown W, Martin BC,

Dormuth CR, Siebert U. Good research practices for comparative effectiveness research: analytic methods to improve causal inference from nonrandomized studies of treatment effects using secondary data sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report – Part III. Value Health 12, 1062–1073 (2009).

Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Stat. Med. 24, 3105–3124 (2004). Lumley T. Network meta-analysis for indirect treatment comparisons. Stat. Med. 21, 2313–2324 (2002). Gisbert JP, Gonzáles L, Calvet X, Roqué M, Gabriel R, Pajares JM. Helicobacter pylori eradication: proton pump inhibitor vs. ranitidine bismuth citrate plus two antibiotics for 1 week – a meta-analysis of efficacy. Aliment. Pharmacol. Ther. 14, 1141–1150 (2000). Cipriani A, Furukawa TA, Salanti G et al. Comparative efficacy and acceptability of 12 new-generation antidepressants: a multiple-treatments meta-analysis. Lancet 373, 746–758 (2009).

n n

21 Groenwold RHH, Van Deursen AMM, Hoes

AW, Hak E. Poor quality of reporting confounding bias in observational intervention studies: a systematic review. Ann. Epidemiol. 18, 746–751 (2008). 22 Montalescot G, Ellis SG, de Belder MA et al.

Enoxaprin in primary and facilitate percutaneous coronary intervention: a formal prospective nonrandomized substudy of the FINESSE trial (Facilitated INtervention with Enhanced Reperfusion Speed to Stop Events). J. Am. Coll. Cardiol. Intv. 3, 203–212 (2010).

11 Jansen JP, Crawford B, Bergman G, Stam W.

Bayesian meta-analysis of multiple treatment comparisons: an introduction to mixed treatment comparisons. Value Health 11, 956–964 (2008).

23 Psaty BM, Siscovick DS. Minimizing bias due

to confounding indication in comparative effectiveness research: the importance of restriction. JAMA 30, 897–898 (2010).

12 Alemayehu D. Assessing exchangeability in

indirect and mixed treatment comparisons. Comp. Effective. Res. 1, 51–55 (2011). 13 Berlin JA, Santanna J, Schmid CH, Szczech

LA, Feldman KA. Individual patient- versus group-level data meta-regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head. Stat. Med. 21, 371–387 (2002).

16

17

18

n

19

Revisits the major issues associated with observational studies from secondary data sources.

n

Provides a comprehensive treatise on the principles and methods of contemporary epidemiologic research.

Lu G, Ades AE. Assessing evidence inconsistency in mixed treatment comparisons. J. Am. Stat. Assoc. 101, 447–459 (2006).

27 D’Agostino RB Jr. Propensity score methods

future science group

31 Zanutto EL. Comparison of propensity score

and linear regression analysis of complex survey data. J. Data Sci. 4, 67–91 (2006). 32 Angrist JD, Imbens GW, Rubin DB.

Identification of causal effects using instrumental variables. J. Am. Stat. Assoc. 91, 444–455 (1996). 33 Martens EP, Pestman WR, de Boer A, Belitser

SV, Klungel OH. Instrumental variables: application and limitations. Epidemiology 17, 260–267 (2006). 34 Murray M. Avoiding invalid instruments and

coping with weak instruments. J. Econ. Perspect. 20, 111–132 (2007). 35 Robins JM, Hernan MA, Brumback B.

Marginal structural models and causal inference in epidemiology. Epidemiology 11, 550–560 (2000). 36 Kline RB. Principles and Practice of Structural

Equation Modeling (3rd Edition). Guilford Press, NY, USA (2010). 37 Schneeweiss SS, Seeger JZD, Jackson JW,

Smith SR. Methods for comparative effectiveness research/patient-centered outcomes research: from efficacy to effectiveness. J. Clin. Epidemiol. 66, S1–S4 (2013). n n

Epidemiology (3rd Edition). Lippincott Williams & Wilkins, PA, USA (2008).

26 Rosenbaum PR, Rubin DB. The central role

Dias S, Welton NH, Sutton AJ, Caldwell DM, Lu G, Ades AE. Evidence synthesis for decision making 4: inconsistency in networks of evidence based on randomized controlled trials. Med. Decis. Making 33, 641–656 (2013).

score methods for reducing the effects of confounding in observational studies. Multivar. Behav. Res. 46, 399–420 (2011).

25 Rothman KJ, Greenland S, Lash TL. Modern

Salanti G, Kavvoura FK, Ioannidis JPA. Exploring the geometry of treatment networks. Ann. Intern. Med. 148, 544–553 (2008).

Dias S, Welton NJ, Caldwell DM, Ades AE. Checking consistency in mixed treatment comparison meta-analysis. Stat. Med. 29, 932–944 (2010).

30 Austin PC. An introduction to propensity

issues, drawbacks and opportunities with observational studies in comparative effectiveness research. J. Eval. Clin. Pract. 19, 579–583 (2013).

15 Caro JJ, Ishak KJ. No head-to-head trial?

Simulate the missing arms. Pharmacoeconomics 28, 957–967 (2010).

covariate balance. Health Serv. Res. 48, 1487–1507 (2013).

24 Alemayehu D, Cappelleri JC. Revisiting

14 Caro JJ, Möller J, Getsios D. Discrete event

simulation: the preferred technique for health economic evaluations? Value Health 13, 1056–1060 (2010).

Addresses methods to improve causal inferences of treatment effects for nonrandomized studies.

of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983). for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med. 17, 2265–2281 (1998). 28 Austin PC. The performance of different

propensity score methods for estimating marginal hazard ratios. Stat. Med. 32, 2837–2849 (2013). 29 Brooks JM, Ohsfeldt RL. Squeezing the

balloon: propensity scores and unmeasured

www.futuremedicine.com

REVIEW

Highlights specialized topics within a journal issue on CER related to its implementation, its challenges in mental health, its role within healthcare delivery system changes, its analytic issues and its design issues.

38 Madigan D, Ryan PB, Schuemie M et al.

Evaluating the impact of database heterogeneity on observational study results. Am. J. Epidemiol. 178(4), 645–651 (2013). 39 Ahmed S, Berzon RA, Revicki DA et al.

The use of patient-reported outcomes (PRO) within comparative effectiveness research: implications for clinical practice and health care policy. Med. Care 50, 1060–1070 (2012). n n

Discusses the value of patient-reported outcomes (PROs) within CER, the types of measures that are likely to be useful in the CER context, PRO instrument selection and key challenges that are associated with using PROs in CER.

40 Basch E, Abernethy AP, Mullins D et al.

Recommendations for incorporating

91

REVIEW  

n n

Alemayehu & Cappelleri

patient-reported outcomes into clinical comparative effectiveness research in adult oncology. J. Clin. Oncol. 30, 4249–4255 (2012).

50 Yen WM. Scaling performance measures:

Offers guidance for improving the availability, consistency and usefulness of information regarding the patient experience in clinical CER in adult oncology.

51 DeMars C. Item Response Theory. Oxford

Statistics. Dmitrienko A, Tamhane A, Bretz F (Eds). Chapman and Hall/CRC, FL, USA (2009). 53 Little RJA, Rubin DB. Statistical Analysis with

Discusses the role of PRO measures in CER, reviews the challenges associated with the inclusion of PROs in CER initiatives, provides a framework for their effective utilization and proposes several areas for future research.

54 Fairclough DL. Designand Analysis of Quality

Missing Values (2nd Edition). John Wiley & Sons, NY, USA (2002).

n

heterogeneity and patient reported outcomes for comparative effectiveness research. Med. Care 48, S17–S22 (2010).

Reviews. Agency for Healthcare Research and Quality, MD, USA (2010). n n

n n

SYSAT, Inc., IL, USA (1999). 48 Chakravarty EF, Bjorner JB, Fries JF.

58 Hoaglin DC, Hawkins N, Jansen JP et al.

Conducting indirect-treatment-comparison and network-meta-analysis studies: report of the ISPOR Task Force on Indirect Treatment Comparisons Good Research Practices – part 2. Value Health 14, 429–437 (2011).

49 Cella D, Riley W, Stone A et al. The

Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. J. Clin. Epidemiol. 63, 1179–1194 (2010).

92

Provides guidance on the interpretation of indirect treatment comparisons in network meta-analysis in order to assist policy-makers and healthcare professionals in using its findings for decision-making.

n n

Provides guidance on the technical aspects of conducting network meta-analyses.

J. Comp. Eff. Res. (2014) 3(1)

Summarizes several articles, within a special issue on network analysis, that offer perspectives on the current state of the science with important new developments in original and review articles from leaders in the field. Berger ML, Mamdani M, Atkins D, Johnson ML. Good research practices for comparative effectiveness research: defining, reporting and interpreting nonrandomized studies of treatment effects using secondary data sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report - part 1. Value Health 12, 1044–1052 (2009). Provides guidance on state-of-the-art approaches for framing research questions and reporting results in order to address the challenges of conducting valid retrospective epidemiologic and health services research studies.

62 Cox E, Marin BC, Van Staa T, Garbe E,

Siebert U, Johnson ML. Good research practices for comparative effectiveness research: approaches to mitigate bias and confounding in the design of nonrandomized studies of treatment effects using secondary data sources: the International Society for Pharmacoeconomics and Outcomes Research Good Research Practices for Retrospective Database Analysis Task Force Report – part II. Value Health 12, 1053–1061 (2009).

Interpreting indirect treatment comparisons and network meta-analysis for health-care decision making: Report of the ISPOR Task Force on Indirect Treatment Comparisons Good Research Practices: part 1. Value Health 14, 417–428 (2011).

the SAS® System for Factor Analysis and Structural Equation Modeling. SAS Institute, Inc., NC, USA (1994).

Improving patient reported outcomes using item response theory and computerized adaptive testing. J. Rheumatol. 34, 1426–1443 (2007).

The articles in this series together form the current ‘Methods Guide for Comparative Effectiveness Reviews’ of the Effective Health Care Program established by the Agency for Healthcare Research and Quality (AHRQ).

n n

57 Jansen JP, Fleurence R, Devine B et al.

46 Hatcher L. A Step-by-Step Approach to Using

47 Steiger JH. EzPATH: Causal Modeling.

61

56 Methods Guide for Comparative Effectiveness

45 Flora DB, Curran PJ. An empirical

evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychol. Methods 9, 466–491 (2004).

n n

Bonomi P. Comparison of several model-based methods for analyzing incomplete quality of life data in cancer clinical trials. Stat. Med. 17, 781–796 (1998).

44 Muthén BO, Kaplan D. A comparison for

some methodologies for the factor analysis of non-normal Likert variables: a note on the size of the model. Br. J. Math. Stat. Psy. 45, 19–30 (1992).

Methods special issue on network meta-analysis: introduction from the editors. Res. Synthesis Methods 3, 69–70 (2012).

55 Fairclough DL, Peterson HF, Cella D,

Assessment, Analysis and Interpretation of Patient-reported Outcomes (2nd Edition). John Wiley & Sons Ltd, UK (2007). Provides a comprehensive and detailed exposition on a wide range of topics that are relevant to PRO measures.

Provides a detailed account of design considerations and analytic methods in clinical trials, with an in-depth exposition on how to address longitudinal analysis, missing data and multiple testing, which may also have relevance to other types of studies.

Highlights seven tutorial papers within a special issue on evidence synthesis methods – specifically on indirect treatment comparisons (network meta-analysis) and cost–effectiveness analysis – for decisionmaking based on the Technical Support Documents in Evidence Synthesis prepared for the NICE Decision Support Unit.

60 Salanti G, Schmid CH. Research Synthesis

of Life Studies in Clinical Trials (2nd Edition). Chapman & Hall/CRC, FL, USA (2010).

42 Horn SD, Gassaway J. Incorporating clinical

n

n n

52 Multiple Testing Problems in Pharmaceutical

Considerations on the use of patient-reported outcomes in comparative effectiveness research. J. Manag. Care Pharm. 17, S27–S33 (2011).

43 Fayers FM, Machin D. Quality of Life: the

Evidence synthesis for decision making 1: introduction. Med. Decis. Making 33, 597–606 (2013).

University Press, NY, USA (2010).

41 Alemayehu D, Sanchez R, Cappelleri JC.

n n

59 Dias S, Welton NJ, Sutton AJ, Ades AE.

strategies for managing local item dependence. J. Educ. Meas. 30, 187–213 (1993).

n n

Provides methodological guidance in order to address the challenges associated with causal inferences from observational studies.

63 Berger ML, Dreyer N, Anderson F, Towse A,

Sedrakyan A, Normand SL. Prospective observational studies to assess comparative effectiveness: the ISPOR Good Research Practices Task Force Report. Value Health 15, 217–230 (2012).

future science group

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls 

n n

Highlights key issues on how to decide when to carry out a prospective observational study in light of its advantages and disadvantages with respect to alternatives, and addresses the challenges and approaches to the appropriate design, analysis and execution of prospective observational studies in order to make them more valuable and relevant to healthcare decision-makers.

64 Vandenbroucke JP, von Elm E, Altman DG

et al.; STROBE Initiative. Strengthening the reporting of observational studies in epidemiology (STROBE): explanation and elaboration. Ann. Intern. Med. 147, W-163–W-194 (2007). 65

Acquadro C, Berzon R, Dubois D et al. Incorporating the patient’s perspective into drug development and communication: an ad hoc task force report of the Patient-Reported Outcomes (PRO) Harmonization Group Meeting at the Food and Drug Administration, February 16, 2001. Value Health 6, 522–531 (2003).

66 Calvert M, Blazeby J, Altman DG, Revicki

DA, Moher D, Brundage MD; CONSORT PRO Group. Reporting of patient-reported

future science group

outcomes in randomized trials: the CONSORT PRO Extension. JAMA 309, 814–822 (2013).

www.fda.gov/downloads/Drugs/ GuidanceComplianceRegulatoryInformation/ Guidances/UCM193282.pdf (Accessed 31 August 2013)

67 Snyder CF, Aaronson NK, Choucair AK

et al. Implementing patient-reported outcomes assessment in clinical practice: a review of the options and considerations. Qual. Life Res. 21, 1305–1314 (2012). n

Summarizes the key issues from the ‘User’s Guide for Implementing Patient-Reported Outcomes Assessment in Clinical Practice’ developed by the International Society for Quality of Life Research (ISOQOL).

68 Wang R, Lagakos SW, Ware JH, Hunter DJ,

Drazen JM. Statistics in medicine – reporting of subgroup analyses in clinical trials. N. Engl. J. Med. 357, 2189–2194 (2007). 69 Oxman AD, Guyatt GH. A consumer’s guide

to subgroup analyses. Ann. Intern. Med. 116, 78–84 (1992).

■■ Websites 101 US FDA. Guidance for industry: patient-

reported outcome measures: use in medical product development to support labeling claims (2009).

www.futuremedicine.com

REVIEW

n

Describes how the US FDA reviews and evaluates existing, modified or newly created PRO instruments that are used to support claims in approved medical product labeling.

102 EMA, Committee for Medicinal Products

for Human Use. Reflection paper on the regulatory guidance for the use of healthrelated quality of life (HRQL) measures in the evaluation of medicinal products (2005). www.ema.europa.eu/docs/en_GB/ document_library/Scientific_ guideline/2009/09/WC500003637.pdf (Accessed 31 August 2013) 103 EMA, Committee for Medicinal Products for

Human Use. Qualification of novel methodologies for drug development: guidance to applicants (2009). www.ema.europa.eu/docs/en_GB/ document_library/Regulatory_and_ procedural_guideline/2009/10/ WC500004201.pdf (Accessed 31 August 2013)

93

Evaluating methodological assumptions in comparative effectiveness research: overcoming pitfalls.

The scope of comparative effectiveness research (CER) is wide and therefore requires the application of complex statistical tools and nonstandard proc...
824KB Sizes 0 Downloads 0 Views