552472

research-article2014

ASMXXX10.1177/1073191114552472AssessmentFalkenstrom et al.

Article

Confirmatory Factor Analysis of the Patient Version of the Working Alliance Inventory–Short Form Revised

Assessment 2015, Vol. 22(5) 581­–593 © The Author(s) 2014 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/1073191114552472 asm.sagepub.com

Fredrik Falkenström1,2, Robert L. Hatcher3, and Rolf Holmqvist1

Abstract The working alliance concerns the quality of collaboration between patient and therapist in psychotherapy. One of the most widely used scales for measuring the working alliance is the Working Alliance Inventory (WAI). For the patient-rated version, the short form developed by Hatcher and Gillaspy (WAI-SR) has shown the best psychometric properties. In two confirmatory factor analyses of the WAI-SR, approximate fit indices were within commonly accepted norms, but the likelihood ratio chi-square test showed significant ill-fit. The present study used Bayesian structural equations modeling with zero mean and small variance priors to test the factor structure of the WAI-SR in three different samples (one American and two Swedish; N = 235, 634, and 234). Results indicated that maximum likelihood confirmatory factor analysis showed poor model fit because of the assumption of exactly zero residual correlations. When residual correlations were estimated using small variance priors, model fit was excellent. A two-factor model had the best psychometric properties. Strong measurement invariance was shown between the two Swedish samples and weak factorial invariance between the Swedish and American samples. The most important limitation concerns the limited knowledge on when the assumption of residual correlations being small enough to be considered trivial is violated. Keywords Working Alliance Inventory, confirmatory factor analysis, Bayesian structural equations modeling, measurement invariance The working alliance, originally a psychoanalytic concept that was adopted by psychotherapy researchers in the 1970s (Bordin, 1979; Horvath, 2000), concerns the role of collaboration between therapist and patient in the production of psychotherapy outcomes. Defined by Bordin (1979) as agreement between therapist and patient on what goals treatment should aim for, collaboration on therapeutic tasks, and a positive emotional bond (i.e., mutual trust, liking, etc.) between therapist and patient, the working alliance is considered by most researchers as important regardless of therapy method (although some components of the alliance may be more important in some treatments than in others). Large-scale meta-analyses have found a robust relationship between early alliance and symptom change during treatment (Flückiger, Del Re, Wampold, Symonds, & Horvath, 2012; Horvath, Del Re, Fluckiger, & Symonds, 2011; Horvath & Symonds, 1991), although some areas of controversy remain such as direction of causality (e.g., DeRubeis, Brotman, & Gibbons, 2005; Falkenström, Granström, & Holmqvist, 2013; ZilchaMano, Dinger, McCarthy, & Barber, 2014) and therapist effects (Baldwin, Wampold, & Imel, 2007; Crits-Christoph, Gibbons, Hamilton, Ring-Kurtz, & Gallop, 2011; Falkenström, Granström, & Holmqvist, 2014; Zuroff, Kelly, Leybman, Blatt, & Wampold, 2010).

There is an abundance of scales for measuring the working alliance (e.g., Alexander & Luborsky, 1986; Gaston, 1991; Horvath & Greenberg, 1989; O’Malley, Suh, & Strupp, 1983). One of the most widely used is the Working Alliance Inventory (WAI; Horvath & Greenberg, 1989). The WAI was originally created from Bordin’s theory, with items selected on the basis of the three theoretically proposed alliance dimensions: agreement on goals, tasks, and emotional bond between patient and therapist. Subsequently, several studies have examined the factor structure of the WAI, some confirming a three-factor structure (Busseri & Tyler, 2003; Hatcher & Gillaspy, 2006; Horvath & Greenberg, 1989; Munder, Wilmers, Leonhart, Linster, & Barth, 2010; Tracey & Kokotovic, 1989) although usually noting that the correlation between the Task and Goal factors is very high (close to .90; see Busseri & Tyler, 2003; Horvath & Greenberg, 1989) and some preferring a two-factor structure with Task and 1

Linköping University, Linkoping, Sweden Uppsala University, Uppsala, Sweden 3 City University of New York, New York, NY, USA 2

Corresponding Author: Fredrik Falkenström, Lustigkullevägen 17, SE-616 33 Åby, Sweden. Email: [email protected]

Downloaded from asm.sagepub.com at UNIV OF WINNIPEG on October 8, 2015

582

Assessment 22(5)

Goal factors combined (Andrusyna, Tang, Derubeis, & Luborsky, 2001; Hatcher & Barends, 1996; Webb et al., 2011). Some of these studies used the patient-rated version of the WAI, some used the therapist version, and others used the observer version. In the present article, we address the patient version, and we will therefore limit ourselves to discussions of this version from now on. Some of the previous studies on the factor structure of the WAI have used exploratory factor analysis (EFA). Since there is a clear theoretical structure for the WAI, and because there exists plenty of research on the scale, there is by now no need for further exploratory studies. Confirmatory factor analysis (CFA) imposes an a priori structure on the data, and model testing is used to confirm or refute this structure. An early CFA on the WAI by Tracey and Kokotovic (1989) found support for a bifactor model with one general alliance factor and three subfactors conforming to the theoretically proposed Goal, Task, and Bond factors. The authors developed a short form for the WAI using the 12 items that loaded most strongly on the specific factors. Noting limitations of the Tracey and Kokotovic (1989) study in the form of small sample size (N = 84 for the patient version), model fit indices not within acceptable standards, and administration of the WAI after the first session of counseling when the alliance may not have been adequately formed, Hatcher and Gillaspy (2006) performed a new CFA on two larger samples (N = 231 and 235). The factor structures of the 36-item and the Tracey and Kokotovic (1989) 12-item versions were first tested in both samples. One-, two-, and threefactor structures were tested (including the bifactor model), and none of these showed acceptable model fit in either sample. The authors then went on to create an alternative 12-item version (WAI-SR) using EFA on the first sample. This alternative 12-item version was subsequently tested in the second sample using CFA, showing much better model fit. A three correlated factors structure was found best in terms of model fit, and in the revised 12-item version (WAI-SR) the correlation between Task and Goal was found to be more moderate than in previous studies (r = .61 in Sample 1 and r = .79 in Sample 2). It should be noted, however, that even though Hatcher and Gillaspy (2006) used more stringent criteria for model fit than previous studies, they still used approximate fit indices (root mean square error of approximation, comparative fit index [CFI], and Tucker–Lewis index) to evaluate model fit rather than the more stringent likelihood ratio chisquare model fit test. Indeed, even their final models failed the chi-square test at p = .001 in samples that were not extremely large. This indicates significant deviations between the model-implied and observed covariance matrices. These results were replicated by Munder et al. (2010) using the German version of the WAI-SR, with essentially the same results (i.e., fit indices within commonly accepted norms but chi-square test significant at p < .001).

The issue of model fit evaluation has been hotly debated within the structural equations modeling community. The chi-square test of “exact model fit” has been criticized because in large samples it can be overly sensitive to small errors of misspecification (e.g., Bentler & Bonett, 1980). Also, the constraints put on models by setting all parameters not included in the model to zero seem to make most factor models fail this test (Rigdon, 2012). This has led most researchers to ignore significant chi-square tests and instead rely on approximate fit indices, which are used more as “effect size” measures of degree of model misfit. Other researchers, however, have criticized the use of approximate fit indices (e.g., Hayduk, Cummings, Boadu, PazderkaRobinson, & Boulianne, 2007; McIntosh, 2007). First, although it is possible that a significant chi-square test is due to small, trivial misspecifications in large samples, a researcher cannot know this without conducting extensive and often difficult diagnostic examinations. Second, the rules of thumb for approximate fit indices are somewhat arbitrary. Moreover, when respecifying models using modification indices there is increased risk of capitalizing on chance. Striking a balance between these opposing views on model fit evaluation, B. O. Muthén and Asparouhov (2012) proposed Bayesian estimation using zero mean and small variance priors for constrained parameters, instead of maximum likelihood (ML) estimation, which assumes that nonestimated parameters are exactly equal to zero. The small variance priors allow for small deviations from zero in the parameters that are not of primary interest to the researcher, so that such trivial deviations do not ruin model fit. In the context of CFA, the authors present simulations and applied examples showing the use of small variance priors for cross-loadings and residual correlations. They recommend a 95% confidence interval between −.20 and .20 for the priors, arguing that factor loadings and correlations of this magnitude can usually be considered trivial. Still, while allowing trivial deviations from zero, overall model fit is tested using Posterior Predictive Checking to check for significant deviations from the hypothesized structure. The authors argue that this constitutes a more flexible alternative for testing substantive theoretical hypotheses. An issue that is often overlooked in psychometric validation studies is measurement invariance. Measurement invariance refers to the stability of factor structure across samples and/or over time. When measurement invariance does not hold, it means that respondents from different samples interpret questionnaire items in different ways or that respondents’ interpretations of item content change over time. In this case, the typical use of sum scores to compare people from different samples or to evaluate change over time is flawed (Steinmetz, 2013). In their review of the measurement invariance literature, Vandenberg and Lance (2000) state that “violations of measurement equivalence

Downloaded from asm.sagepub.com at UNIV OF WINNIPEG on October 8, 2015

583

Falkenstrom et al. assumptions are as threatening to substantive interpretations as is an inability to demonstrate reliability and validity” (p. 6). For example, when testing a translation of a measure it is of interest if the factor structure of the translation is similar to the original. This can be done by visual inspection of the factor loadings and intercepts, but a more stringent test is to use multigroup measurement invariance analysis (assuming of course that data are available for both the original and translated versions of the test). In the present article, the factor structure of the WAI-SR is tested using CFA in three independent samples. Two of these samples use a Swedish translation of the WAI-SR, while the third is one of the samples originally used to create the revised version of the WAI (Hatcher & Gillaspy, 2006). We will use regular ML estimation alongside the newer Bayesian structural equations modeling (BSEM) to analyze the factor structure of the WAI-SR. Although Bayesian and Frequentist approaches to statistics are in many ways in opposition, a pragmatic approach is taken here in which Bayesian estimation is used for testing models that are not estimable using ML due to identification issues.1 We hypothesize that the chi-square test of model fit will show significant ill-fit for models assuming nonestimated parameters to be exactly zero, but that including zero-mean, small variance priors for cross-loadings and/or residual correlations in BSEM will show that the model illfit can be explained by trivial cross-loadings and/or residual correlations. As a second step, we hypothesize that the essential factor structure will be the same across English and Swedish versions of the WAI-SR, as shown by multigroup measurement invariance tests.

Method Participants Sample 1. This sample was the second sample used by Hatcher and Gillaspy (2006) in the development of the Working Alliance Inventory–Short form Revised (WAISR, see below) for cross-validating the factor structure derived from an EFA from their first sample. The sample consisted of 235 adult outpatient clients (71% women, 24% men, and 5% unidentified as to gender) from a number of counseling centers and outpatient facilities primarily from the southwestern United States filling out the WAI at Session 3. The clients were treated using different psychotherapeutic approaches, although most common were cognitive-behavioral therapy (CBT) and psychodynamic treatment. Age ranged from 18 to 64 years (M = 28.4, SD = 9.9). More details are available in Hatcher and Gillaspy (2006). Sample 2.  The second sample consisted of patients attending primary care counseling and psychotherapy of different

orientations (most were versions of CBT or psychodynamic therapy) at two service regions in Sweden. At the third session, 634 patients filled out the Swedish translation of the WAI-SR. Demographic information was available for between 75% and 80% of the patients. The mean age was 37.3 years (median= 35, SD = 14.3, range 14-88), 74% were women, and 92% were born in Sweden. More details are available in Falkenström et al. (2013, 2014) and in Holmqvist, Ström, and Foldemo (2014). Sample 3. A third sample was composed of 234 patients from specialist psychiatric departments throughout Sweden, from an ongoing naturalistic study of routine psychotherapy delivered in psychiatric care. The patients were treated using psychotherapy of different orientations, and the WAI-SR data were taken from Session 3 in order to be comparable to the other two samples. Demographic information was unavailable at the time the present analyses were done.

Measures Working Alliance Inventory–Short Form Revised (Hatcher & Gillaspy, 2006).  The WAI-SR was developed using EFA and CFA together with item response theory modeling. Patients in Sample 1 filled out the original English language version of the WAI-SR, while Samples 2 and 3 used a Swedish translation made by Rolf Holmqvist and Tommy Skjulsvik. The translation was done using back-translation and modifications in several steps. All 12 items of the English version are shown in the appendix. Although the Hatcher and Gillaspy (2006) study, based on their item response theory analyses, recommended a 5-point response scale, in the present study the original 7-point response scale was used in all three samples.

Statistical Analyses We used CFA, estimated both using regular ML and Bayesian estimation. For the model specifications, the fixed factor method of identification (Little, 2013) was used in which the factor variances are constrained to 1 while all factor loadings are estimated freely. Bayesian Structural Equations Modeling. Bayesian statistics differs from Frequentist statistics primarily in two ways: (a) parameters are not considered fixed but are seen as random variables with distributions and (b) prior information is used directly in model estimation and the result of the analysis (called the posterior distribution) is the product of the prior information and the likelihood obtained for the data (e.g., Zyphur & Oswald, 2015). In addition, estimation of Bayesian models are usually done using simulation-based methods, called Markov Chain Monte Carlo (MCMC)

Downloaded from asm.sagepub.com at UNIV OF WINNIPEG on October 8, 2015

584

Assessment 22(5)

estimation. Briefly, MCMC simulates values of parameters from the posterior distribution, given the model and the data. This is done in a series of steps in which each step depends on the results of the previous one. Given a long enough chain, this procedure should converge on the most likely parameter estimates. Usually more than one chain is run, in order to enable testing if the chains converge on similar distributions. In the present study two chains were used in all analyses. The estimates from the simulated parameters of the Markov Chain(s) (although usually the first part of the chains are “burnt” because one wants to get rid of the influence of arbitrary starting values) constitute the posterior distribution. The posterior distribution can be summarized in ways that may make the results of a Bayesian analysis look similar to Frequentist statistics, for example, using the mean and the 2.5 to 97.5 percentiles—a Bayesian version of the confidence interval that is usually termed the 95% credibility interval. This credibility interval can be interpreted straightforwardly as the probability that the parameter is within a certain interval, something that cannot be done with the Frequentist confidence interval because this is based on the idea of a large number of replications of a study and do not directly speak to the probability of a certain estimate (Kaplan & Depaoli, 2012). Perhaps the most controversial aspect of Bayesian statistics is the inclusion of prior information in statistical estimation, because of the subjectivity inherent in the choice of values for the prior. Priors can be more or less informative, and it is usually possible to choose priors that are so uninformative that their impact on the posterior distribution is minimal. The results of analyses using uninformative priors are usually very similar to ML estimates. When choosing informative priors, most researchers require that these should be based on external data—preferably based on previous research. We would argue, along with, for example, B. O. Muthén and Asparouhov (2012), that subjectivity is used in regular structural equations modeling as well in the form of choice of model constraints. For example, when setting up a CFA, a model is used that specifies some paths that should be estimated freely while other paths are constrained to zero. The constraints used in regular CFA can be seen as particularly strong priors that impose exactly zero correlations for certain paths. In regular CFA the plausibility of these constraints is evaluated using model fit testing, and the same can be done for the priors in Bayesian estimation. In a “truly” Bayesian approach, this might not be seen as necessary, since the results of the analysis are supposed to be a compromise between the prior information and the data at hand, but in the more pragmatic approach taken here, model fit evaluation becomes important as a way of ensuring that the priors do not distort parameter estimates beyond the information inherent in the data.

Since BSEM is a relatively new approach, it may be necessary to put this method in relation to more established factor analytic methods. B. O. Muthén and Asparouhov (2012) place their BSEM approach in between EFA and CFA, that is, being more confirmatory than EFA but less so than CFA. BSEM is more confirmatory than EFA in that an a priori structure is postulated in BSEM, while in EFA only the number of factors is stated. Items that are assumed according to theory or previous research to load on a certain factor have freely estimated loadings, while loadings for items not assumed to load on a factor are shrunk toward the prior mean (in this case zero) to a degree specified by the prior variance. Thus, cross-loadings are not estimated freely as in EFA, because the estimates are shrunk toward the prior mean. If the specified degree of shrinkage does not match the data, this will show up as a model with poor fit. At the same time, BSEM is less confirmatory than traditional CFA in that some parameters that are not central to the theoretical model are allowed to deviate slightly from zero, which can be seen as a weakness or a strength depending on one’s opinion about exact fit testing. In addition, to estimate all possible residual correlations would not be possible in ML, since the model would not be identified (the residual correlations alone would use up all degrees of freedom for the covariance structure). As recommended by B. O. Muthén and Asparouhov (2012), we used informative priors for cross-loadings using a normally distributed prior with a mean of zero and a variance of .01 (corresponding to a 95% confidence interval between −.20 and .20). For the residuals we used the inverse Wishart (IW) distribution, which is the conjugate prior for the covariance matrix of a multivariate normal distribution. For the residual correlations, the mean of this prior was set to zero because of the assumption of approximately zero residual correlations. The level of informativeness for the IW distribution is controlled by the degrees of freedom. In this case, the degrees of freedom was set to the number of observed variables plus 30, which gives a prior variance of approximately .01, similar to the specification for the crossloadings. For the specification of the prior for the diagonal element of the IW matrix (i.e., the residual variances), we chose a mean of .2 × the mean of the observed variables’ variances, reflecting the desire that factor loadings explain most of the variance in the indicators. Sensitivity analyses with larger variance for the priors (indicating less certainty in the prior distributions) were conducted to see if estimates changed markedly by this. As recommended by B. O. Muthén and Asparouhov (2012), we used a minimum of 50,000 iterations for the primary analyses, and the final models were reestimated using 100,000 iterations. Convergence of the Markov Chains was checked using the potential scale reduction .01 as our primary indication of violation of measurement invariance. In metric models of measurement invariance analyses, the factor variances of one group were constrained to 1 while for the other groups factor variances were estimated freely. Similarly, in the scalar model, factor means were constrained to zero in one group but estimated freely in the other. Multilevel Data Structure. Psychotherapy data are usually structured in a multilevel format, with repeated measurements nested within patients who in turn are nested within therapists (if the same therapists treat more than one patient, which is usually the case). Several studies show a significant between-therapist variation in patient-rated working alliance (Baldwin et al., 2007; Crits-Christoph et al., 2009; Falkenström et al., 2014; Zuroff et al., 2010). Because nesting introduces within-cluster correlations (intraclass correlations [ICCs]), which violates the assumptions of traditional statistical tests such as ANOVA or multiple linear regression, psychotherapy researchers are increasingly utilizing random effects multilevel modeling in order to account for this statistical dependency. Random effects models are not often used in factor analytical studies, although there are notable exceptions (e.g., B. O. Muthén, 1991; Reise, Ventura, Nuechterlein, & Kim, 2005; Roesch et al., 2010). The

reason for this may be that standard statistical software has until recently not been able to estimate such models. Such software is now available (e.g., Mplus 7.0 and Stata 13). All analyses in this study were done using the software Mplus 7.1 (L. K. Muthén & Muthén, 1998-2012).

Results Descriptive Statistics Table 1 shows number of observations, means, and standard deviations for each item in the three samples. The ICCs for the nesting of patients within therapists in Sample 2 ranged from .06 to .15, showing that most of the variance in the indicators was on the within-therapist level. Thus, the amount of statistical dependency was not alarmingly high. To test the joint significance of variances at the therapist level, we estimated a two-level model with all betweenlevel variances constrained to zero and compared this to a model in which between-level variances were estimated freely. The chi-square difference test was statistically nonsignificant (Δχ2 = 16.37, df = 12, p = .17), indicating that estimating therapist level variances did not significantly improve model fit. Because of this, all subsequent models were run as single-level models.

Confirmatory Factor Analyses of WAI-SR Factor models were first run on all three samples using ML with robust standard errors,2 assuming exactly zero crossloadings and residual correlations, in order to test our first hypothesis that the likelihood ratio chi-square test of exact fit would show significant ill-fit for such a model. As can be seen in Table 2, statistically significant chi-square tests show that exact fit tests failed for all models in all samples (as expected). When reestimating the same models (i.e., models assuming exactly zero cross-loadings and residual correlations) using Bayesian estimation, results were the same (see right-hand side columns of Table 2). This is also expected, since Bayesian estimation with noninformative priors relies only on the data likelihood and should thus yield similar estimates as ML. Table 3 shows that estimating cross-loadings using zero mean and small variance priors improved model fit somewhat, but these models were still unacceptable in terms of model fit. However, estimating residual correlations using zero mean and small variance priors resulted in excellent model fit in all three samples (i.e., posterior predictive p values very close to .50 and 95% confidence interval symmetrically centered around zero).3 The source of model ill-fit for the ML estimated models thus seemed to be identified in the form of residual correlations between indicators. However, although zero-mean and small variance priors were used for the residual correlations,

Downloaded from asm.sagepub.com at UNIV OF WINNIPEG on October 8, 2015

586

Assessment 22(5)

Table 1.  Descriptive statistics. Sample 1   Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12

Sample 2

Mean

N 235 235 232 233 234 233 230 235 234 234 234 234

5.53 5.47 5.89 5.52 6.42 5.86 5.56 5.98 5.93 5.83 5.54 5.79

SD

N

1.25 1.31 1.17 1.53 0.92 1.27 1.30 1.07 1.20 1.19 1.36 1.12

633 634 617 631 630 629 622 632 616 631 633 634

Mean 5.23 5.28 5.75 5.73 6.37 5.85 5.92 5.92 5.84 5.57 5.64 5.81

Sample 3 SD

N

Mean

SD

1.20 1.19 1.18 1.22 0.92 1.19 1.13 1.11 1.16 1.23 1.21 1.18

233 232 230 233 234 227 230 228 230 233 231 231

5.02 5.13 5.58 5.79 6.32 5.89 5.77 5.79 5.73 5.54 5.54 5.72

1.36 1.32 1.31 1.28 0.99 1.24 1.26 1.20 1.26 1.31 1.35 1.26

Table 2.  Model Fit for Maximum Likelihood and Bayesian Estimation, Assuming Exactly Zero Cross-Loadings and Residual Correlations. Maximum likelihood robust   Sample 1  One-factor  Two-factor  Three-factor Sample 2  One-factor  Two-factor  Three-factor Sample 3  One-factor  Two-factor  Three-factor

Bayesian

χ2

df

p

AIC

BIC

χ2 95% CI

PP p

DIC

BIC

223.9 108.6 98.5

54 53 51

Confirmatory Factor Analysis of the Patient Version of the Working Alliance Inventory--Short Form Revised.

The working alliance concerns the quality of collaboration between patient and therapist in psychotherapy. One of the most widely used scales for meas...
314KB Sizes 0 Downloads 5 Views