2014 Rossi Award Lecture:* Beyond Internal Validity

Evaluation Review 2015, Vol. 39(2) 167-178 ª The Author(s) 2015 Reprints and permission: DOI: 10.1177/0193841X15573659

Larry L. Orr1

Abstract Background: For much of the last 40 years, the evaluation profession has been consumed in a battle over internal validity. Today, that battle has been decided. Random assignment, while still far from universal in practice, is almost universally acknowledged as the preferred method for impact evaluation. It is time for the profession to shift its attention to the remaining major flaws in the ‘‘standard model’’ of evaluation: (i) external validity and (ii) the high cost and low hit rate of experimental evaluations as currently practiced. Recommendations: To raise the profession’s attention to external validity, the author recommends some simple, easy steps to be taken in every evaluation. The author makes two recommendations to increase the number of interventions found to be effective within existing resources: First, a two-stage evaluation strategy in which a cheap, streamlined Stage 1 evaluation is followed by a more intensive Stage 2 evaluation only for those interventions found to be effective in a Stage 1 trial and, *

The Peter H. Rossi Award is given every other year to honor the lifetime achievements of Peter Rossi by recognizing important contributions to the theory or practice of program evaluation. This paper is an edited version of the acceptance remarks of the 2014 awardee, Dr. Larry Orr, delivered November 6, 2014, at the APPAM Fall Research Conference in Albuquerque, New Mexico. 1 Department of Health Policy and Management, Johns Hopkins University, Baltimore, MD, USA Corresponding Author: Larry L. Orr, Department of Health Policy and Management, Johns Hopkins University, 4402 Leland Street, Chevy Chase, MD 20815, USA. Email: [email protected]


Evaluation Review 39(2)

second, use of random assignment to guide the myriad program management decisions that must be made in the course of routine program operations. This article is not intended as a solution to these issues: It is intended to stimulate the evaluation community to take these issues more seriously and to develop innovative solutions. Keywords methodological development, content area, outcome evaluation (other than economic evaluation), design and evaluation of programs and policies For much of the last 40 years, the evaluation profession has been consumed in a battle over internal validity. And, as Judy Gueron and Howard Rolston have reminded us in their aptly titled book, Fighting for Reliable Evidence, it was a battle. They speak of the ‘‘long struggle’’ to convince researchers and policy makers that random assignment is the method of choice to produce credible answers to important questions, like the effect of different welfare regimes on the work effort of the poor (Gueron & Rolston, 2013). Today, that battle has been decided. Random assignment, while still far from universal in practice, is almost universally acknowledged as the preferred method for impact evaluation. There is even a term for it: ‘‘the gold standard.’’ So I would like to declare that war over and think about the next set of challenges we face in improving our craft. In thinking about those challenges, I want to start with what I call the Standard Model of impact evaluation—again, far from universally implemented, but almost universally viewed as the best practice model. The Standard Model has several main features: 


Random assignment of a sample of individuals or groups of individuals, like classrooms or schools, to one or more interventions (the treatment group(s)) or to the status quo or ‘‘business as usual’’ (the control group); a small number of purposively selected sites; one to two rounds of follow-up surveys for all sample members; a process or implementation analysi; and a benefit–cost analysis.

The Standard Model is focused on random assignment and internal validity. And from that limited perspective, it is the gold standard. But viewed more broadly, the Standard Model is anything but a gold standard.



And I know about gold—I grew up in a gold mining town. I actually worked in a gold mine. So I know my precious metals. Peter Rossi may have known his base metals, like iron and brass. But I know precious metals. And the Standard Model is not a gold standard. The Standard Model of impact evaluation has at least two major flaws. First, as generally practiced, it has terrible external validity. And, second, it fails to take into account Peter Rossi’s Iron Law. I see these flaws as posing the next big challenges for the evaluation community. Let me speak to each in turn. First, external validity. In working on the issue of external validity over the past several years with my colleagues Rob Olsen, Steve Bell, and Liz Stuart (see Olsen, Orr, Bell, & Stuart 2013), I’ve discovered that term means very different things to different people. For example, The Digest of Social Experiments (Greenberg & Shroder, 2004) contains, for each of 273 studies, a section called ‘‘Generalizability,’’ which most would take to mean external validity. For different experiments, this section discusses generalizability in terms of representativeness of the sample and the environment, program design, sample size, scale of program, implementation issues, attrition rate, outcome measures, and a number of other factors. So let me state clearly what I mean by external validity. An externally valid evaluation provides unbiased estimates of the impact of an intervention on the population of interest for policy—that is, the population that would be affected if the intervention were adopted as policy (or, in the case of an ongoing program, the population currently served by that program). if, for example, that the U.S. Department of Labor tests a new approach to job placement for unemployment insurance (UI) recipients, the population of policy interest is all UI recipients nationwide. If the state of Wyoming tests such a policy, the population of policy interest is all UI recipients in Wyoming. Why? Because the U.S. Department of Labor makes policy for the nation as a whole, not just Dayton or San Antonio. And the state of Wyoming makes policy for the entire state, not just Cheyenne or Newcastle. Unfortunately, our evaluations almost never test interventions on samples that are representative of the population of interest for policy. Instead, we test interventions in sites that are convenient or cooperative, without regard to how well they represent the population of interest. Often, we do not even specify the population of interest. Of the 273 randomized trials described in the Digest of Social Experiments (Greenberg & Shroder, 2004), only 7 were designed to be representative of the population of interest.


Evaluation Review 39(2)

Does it matter? Well, if interventions have the same impact everywhere, it doesn’t matter where you test them. But if impacts vary across sites, it does. And there is pretty good evidence that they do. For example, the Charter School study found school-specific impacts that varied from significantly negative to significantly positive (Gleason, Clark, Tuttle, & Dwoyer, 2010). And Howard Bloom and Christina Weiland have found equally striking, and statistically significant, variation in impacts on various outcomes in the National Head Start Study (Bloom & Weiland, In press). So we have to at least allow that choice of sites may matter. Let me be clear. We don’t know that the 266 evaluations in the Digest that did not choose representative sites yielded biased estimates of effects on the population of policy interest. But we don’t know that they didn’t—just as we never know whether nonexperimental estimates are internally biased. And just as we have become unwilling to accept estimates of unknown internal validity, we should be unwilling to accept estimates of unknown external validity. I know what you are thinking: ‘‘So every evaluation has to have a random sample of the U.S. population? Right. Like that’s going to happen!’’ That’s not what I’m saying. I’m just saying that external validity is a problem that needs to be taken seriously and that the smart people in this room have to figure out how to do better. I don’t have any magic solutions. But I do know that until evaluation sponsors demand more representative results, and evaluators apply their considerable talents to producing them, the situation isn’t going to change. And I can’t resist saying that it is possible to conduct evaluations on nationally representative samples, even for large programs. That’s what was done in the evaluations of Job Corps (Schochet, Burghardt, & McConnell, 2008), the Food Stamp Employment and Training Program (Puma & Burstein, 1994), and Head Start (Puma, Bell, Cook, & Heid, 2010). In the Benefit Offset National Demonstration, the Social Security Administration is implementing an experimental treatment that encompasses a 20% random sample of all Social Security Disability Insurance beneficiaries nationwide (Gubits, Cook, & Bell, 2013). So it can be done. But I’m not asking you to draw nationally representative samples because I know that, in most cases, you won’t. Here is what I am asking you to do—in every evaluation you conduct: 1. 2.

Define the population of policy interest at the outset. Think about how you can select sites and draw samples that have a reasonable relationship to that population of interest.



3. 4. 5.

Compare your sample to the population of policy interest on relevant characteristics and outcomes. Document all of this in your design report. Once you have results, use one of the various techniques that are available to project your estimates to the population of policy interest. Report those results along with the results for your actual sites.

I am convinced that if every evaluation followed these simple, easy steps, the evaluation community would be much more cognizant of, and committed to achieving, external validity. And the advice we provide to policy makers would be correspondingly better and more useful. The second major flaw in the Standard Model is that it fails to take account of Rossi’s Iron Law:The expected value of a net impact assessment of any large scale social program is zero. (Rossi, 1987, p. 4).

My initial reaction to the Iron Law was probably about the same as yours, that is, ‘‘Gee, that’s a clever bit of hyperbole! Of course, nobody would take it literally. And it’s almost certainly wrong.’’ Peter himself substantially backed off a literal interpretation of the Iron Law in remarks he made at APPAM a few years ago (Rossi, 2003). I have come to believe, although, that the Iron Law is a pretty good description of reality. Note that saying that the expected value of the impact of the interventions we test is zero is not the same as saying that they all have zero effect. It just means that the distribution of effects is centered on zero. That implies that roughly half of the interventions we test have zero or negative effect (i.e., are no better than the status quo). That may not be far from the truth.1 For example, a review by the Coalition for Evidence-based Policy found that of the 90 interventions evaluated in randomized trials commissioned by the Institute of Education Sciences between 2002 and 2013, approximately 90% were found to have weak or no positive effects. Six of the 10 randomized evaluations of science, technology, engineering, and math programs found weak or no effects. Of the 13 interventions evaluated in Department of Labor randomized trials that have reported results since 1992, about 75% were found to have found weak or no positive effects (Coalition for Evidence-Based Policy, 2013). The situation is similar in medicine. A recent study published in the Journal of Clinical Epidemiology shows that 82% of diagnostic tests don’t improve patient outcomes (Siontis, Siontis, Contopoulos-Ioannidis, & Ioannidis, 2014).


Evaluation Review 39(2)

And much the same appears to hold in business. In his book Uncontrolled, Jim Manzi reports that of 13,000 randomized trials of new products/strategies conducted by Google and Microsoft, 80–90% have found no significant effects. Manzi also cites a Cambridge University review of 122 randomized field trials (RFTs) in criminology conducted between 1957 and 2004 in which only about 20% found statistically significant reductions in crime from the interventions tested. On the basis of this and other evidence, Manzi concludes that ‘‘ . . . the vast majority of criminal justice, social welfare, and education programs fail replicated, independent, well-designed RFTs’’ (Manzi, 2012, Chap. 12). One implication one might draw from this dismal hit rate is that we should test better interventions. That is, in fact, the implication Peter Rossi drew—in his original article, he used the Iron Law to motivate attention to designing better programs (Rossi, 1987).2 And I certainly wouldn’t want to discourage that enterprise. It appears to be the case, however, that it is almost impossible to predict with any confidence which interventions are likely to succeed. That is, after all, why we test them. Even in medicine, with its highly structured sequence of tests leading to clinical trials, treatments that appear promising in Phase II studies frequently fail large, rigorous Phase III randomized trials. For example, Zia, Siu, Pond, and Chen (2005) report a success rate of only 28% among 43 Phase III studies based on Phase II trials of identical drugs. They also cite success rates of 2–24% across all Phase III trials in several oncology specialties. If extensive Phase II testing cannot yield more effective interventions for Phase III testing in medicine, it seems unlikely that social scientists and policy analysts can do much better. These low hit rates shouldn’t come as a surprise. After all, we test new interventions against the status quo. It shouldn’t be surprising that about half the time an untested new intervention can’t improve on the best efforts of existing programs and institutions. But what does all this have to do with the Standard Model of evaluation? Well, an evaluation designed according to the Standard Model typically takes 5 years to complete and costs upward of several million dollars. Most agencies’ research budgets will only support one or two of these per year. If we cannot improve the success rate by choosing better interventions, at this rate it is going to take a very long time to identify any appreciable number of effective interventions. So what can we do? We can do better, cheaper, faster experiments. In a review of the Department of Labor’s evaluation program (Maynard, Orr, & Baron, 2013), Rebecca Maynard, Jon Baron, and I made several



recommendations to take into account the low hit rate of social interventions. I believe that these recommendations are more generally applicable to other policy areas. First, we urged the Department to choose interventions for testing by strategically searching the existing evaluation literature to identify the strongest candidates—that is, those most likely to produce sizable positive impacts. This evidence might take the form of smallscale trials or program components that appeared to be important in earlier rigorous tests of more comprehensive interventions. Raising the bar for investing in a rigorous test should improve the hit rate. But given the evidence in other fields, that is unlikely to be enough. Our second recommendation, therefore, was to conduct experiments in a two-stage process. The first stage would be a streamlined experimental evaluation to measure the intervention’s impact on the primary outcome(s) of interest (e.g., earnings). Costs would be substantially reduced relative to the Standard Model by eliminating, or at least minimizing, the use of survey data collection, relying instead on low-cost administrative data. This first stage would thus be designed to answer the most important question for policy, that is, does the intervention produce the main hoped for effects? We suggested that this initial evaluation also obtain basic information on the implementation and cost of the intervention being evaluated, to help inform its replication should it be found effective. However, at this stage, we cautioned against large investments in process or implementation evaluations and data collection to support exploratory analyses for programs, policies, and practices that will typically not, in the end, be sufficiently effective to warrant adoption. Implementation data would be collected only for descriptive purposes, in a small number of site visits to interview operational staff. Cost data would be restricted to the budgetary cost of the intervention. For interventions that demonstrate program effectiveness relative to budgetary cost in Stage 1, a second stage involving more comprehensive data collection and analyses would go forward. Stage 2 evaluations would look more like the Standard Model, with more intensive process analysis, data collection, and benefit–cost analysis. But unlike the current standard model, restricting such high-cost tests to interventions that have already demonstrated positive effects on central outcomes is almost certain to yield a higher hit rate—that is, we will identify a larger number of effective interventions, faster. The two-stage evaluation strategy also has a more subtle advantage. With a low hit rate, we run the risk of a very high ‘‘false discovery rate.’’3 Allow me to explain. At conventional levels of statistical significance, all


Evaluation Review 39(2)

those ineffective interventions—the ones with a true effect of zero—have a 5–10% chance of coming up statistically significant—that is, of being ‘‘false positives’’ or Type 1 errors. So when one looks only at the interventions that yield statistically significant effects, a lot of them will be false positives. Numerical example: Suppose we test 100 interventions, only 10 of which are truly effective, at the 10% significance level. We will identify 8 or 9 of the truly effective interventions as statistically significant. But the 90 interventions that are ineffective will produce about 9 false positives. Result: about half the statistically significant findings are false positives. This is the false discovery rate. It means that if we adopt all the interventions that yield statistically significant results in a single trial, half of them could be totally ineffective. The surest protection against false positives is replication. In a single test, each ineffective intervention has a 5–10% chance of being a false positive. But the chance of an ineffective intervention being a false positive twice, in two successive replications, is less than 1%. The twostage evaluation strategy automatically replicates all statistically significant results, effectively driving the false discovery rate to almost nothing.4 The point of doing better, cheaper, faster experiments is that we can do more of them with the same resources. There is one area in which we can do lots and lots of really cheap, really quick experiments. This is in the province of the ‘‘M’’ in APPAM. Program management involves a huge number of decisions, many of them relatively small, but collectively of central importance. In many cases, managers face a clear choice among two or more relatively well-defined options. For example, which style letter gets a better response? Should this position be staffed with a Master of Social Work or can we use a Bachelor of Arts? Do spot bonuses improve staff performance? Which audit methods work best for detecting fraud? Many of these choices are susceptible to rigorous analysis with randomized trials. If the suggestion that management decisions like these should be decided with randomized trials seems weird, that’s just because we are all used to management by gut instinct. As Jim Manzi points out in his book, major corporations like Google, Microsoft, and Capital One run thousands of experiments to decide questions like these (Manzi, 2012). I first encountered this use of randomization back in the 90s when we were working with a direct mail firm, designing a brochure to encourage adult workers to return to school to upgrade their skills. The



choice was between a positive message—‘‘Get ahead, stay ahead’’—and scare tactics—‘‘Avoid layoffs.’’ The firm we were working with did a randomized test as a matter of routine, sending out 10,000 of each brochure to randomly selected addresses and adopting the one that got the highest response rate. So what I am suggesting is not at all new—it’s just not done in government.5 It should be. Experiments like these are very cheap to carry out and their results are available in weeks or months, not years. If we believe in evidence-based decision making, this is an area ripe for exploitation. I raise these issues not because I think that doing any of the things I suggest will be easy. On the contrary, they will take imagination and persistence to move the profession out of the comfortable rut it has settled into. But the payoff, in terms of better advice to policy makers and better program management, could be enormous. I urge the profession to take on these challenges. Declaration of Conflicting Interests The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The author received no financial support for the research, authorship, and/or publication of this article.

Notes 1. In fact, as we will see, the proportion of interventions tested that fail to show statistically significant impact appears to be substantially greater than 50%. But that proportion undoubtedly includes a number whose impacts are positive but too small to be detected given the power of the test. 2. Epstein and Klerman (2012) also propose an approach to reducing the number of tests of programs that are likely to be ineffective. 3. See Benjamini and Hochberg (1995) for one of the original articles on the false discovery rate. More recently, Schochet (2009) has provided a relatively comprehensive discussion of the issues involved in the ‘‘multiple testing’’ problem, as well as a review of potential solutions. 4. The notion of replication is, of course, not new. It is, in fact, one of the basic tools of science, in both the social and the physical realms. And while the notion of replication has been honored more in the breach than in practice, it has been applied to a number of evaluations of social programs. See, for example, Miller, Bos, Porter, Tseng, and Abe (2005), Meyer (1995), and Maxfield, Schirm, and


Evaluation Review 39(2)

Rodriguez-Planas (2003). Barnow and Greenberg (2013) discuss these and other replication efforts. 5. Nor am I the only one advocating this approach for social programs—see Cody and Asher (2014), who term this approach ‘‘rapid cycle evaluation’’ and Besharov (2009), who made similar arguments in his Presidential address to the Association of Public Policy Analysis and Management.

References Barnow, B. S., & Greenberg, D. (2013, July). Replication issues in social experiments. Journal for Labour Market Research, 46, 239–252. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 289–300. Besharov, D. J. (2009). Presidential address: From the great society to continuous improvement government: Shifting from does it work? To what would make it better? Journal of Policy Analysis and Management, 28, 199–220. Bloom, H. S., & Weiland, C. (In press). Quantifying variation in head start effects on young children’s cognitive and socio-emotional skills using data from the national head start impact study. Coalition for Evidence-based Policy. (2013). Randomized controlled trials commissioned by the Institute of Education Sciences since 2002: How many found positive versus weak or no effects. Retrieved from http://coalition4evidence. org/468-2/publications/ Cody, S., & Asher, A. (2014, June 19). Proposal 14: Smarter, better, faster: The potential for predictive analytics and rapid-cycle evaluation to improve program development and outcomes. In M. S. Kearney & B. H. Harris (Eds.), Policies to Address Poverty in America (pp. 147–156). Washington, DC: The Brookings Institution. Retrieved from and_links/policies_address_poverty_in_america_full_book.pdf Epstein, D., & Klerman, J. A. (2012). When is a program ready for rigorous impact evaluation? The role of a falsifiable logic model. Evaluation Review, 36, 375. Gleason, P., Clark, M., Tuttle, C. C., & Dwoyer, E. (2010). The evaluation of charter school impacts (No. 6676). Washington, DC: Mathematica Policy Research. Greenberg, D., & Shroder, M. (2004). The digest of social experiments (3rd ed.). Washington, DC: The Urban Institute Press. Gubits, D., Cook, R., & Bell, S. (2013). BOND implementation and evaluation: Stage 2 early assessment report. Cambridge, MA: Abt Associates and Princeton, NJ: Mathematica Policy Research.



Gueron, J. M., & Rolston, H. (2013). Fighting for reliable evidence. New York, NY: Russell Sage Foundation. Manzi, J. (2012). Uncontrolled: The surprising payoff of trial-and-error for business, politics, and society (pp. 128–142). New York, NY: Perseus Books Group. Maxfield, M., Schirm, A., & Rodriguez-Planas, N. (2003). The quantum opportunity program demonstration: Implementation and short-term impacts. Washington, DC: Mathematica Policy Research. Maynard, R., Orr, L. L., & Baron, J. (2013). Increasing the success of evaluation studies in building a body of effective, evidence-based programs: Recommendations of a peer-review panel. Retrieved from Meyer, B. D. (1995). Lessons from the U.S. unemployment insurance experiments. Journal of Economic Literature, 33, 91–131. Miller, C., Bos, J., Porter, K., Tseng, F., & Abe, Y. (2005). The challenge of repeating success in a changing world: Final report on the center for employment training replication sites. New York, NY: MDRC. Olsen, R. B., Orr, L. L., Bell, S. H., & Stuart, E. A. (2013, Winter). External validity in policy evaluations that choose sites purposively. Journal of Public Policy Analysis and Management, 32, 107–121. Puma, M., Bell, S., Cook, R., & Heid, C. (2010). Head start impact study. Final report. Washington, DC: Administration for Children and Families, U.S. Department of Health and Human Services. Retrieved from http://www.acf. Puma, M. J., & Burstein, N. R. (1994, Spring). The national evaluation of the food stamp employment and training program. Journal of Policy Analysis and Management, 13, 311–330. Rossi, P. H. (1987). The iron law of evaluation and other metallic rules. Research in Social Problems and Public Policy, 4, 3–20. Rossi, P. H. (2003, October). The iron law of evaluation reconsidered. Remarks presented at the 2003 APPAM research conference, Washington, DC. Schochet, P. Z. (2009, December). An approach for addressing the multiple testing problem in social policy impact evaluations. Evaluation Review, 33, 539–567. Schochet, P. Z., Burghardt, J., & McConnell, S. (2008). Does job corps work? Impact findings from the national job corps study. The American Economic Review, 98, 1864–1886. Siontis, K. C., Siontis, G. C. M., Contopoulos-Ioannidis, D. G., & Ioannidis, J. P. A. (2014, June). Diagnostic tests often fail to lead to changes in patient outcomes. Journal of Clinical Epidemiology, 67, 612–621.


Evaluation Review 39(2)

Zia, M. I., Siu, L. L., Pond, G. R., & Chen, E. X. (2005, October). Comparison of outcomes of phase II studies and subsequent randomized control studies using identical chemotherapeutic regimens. Journal of Clinical Oncology, 23, 6982–6991.

Author Biography Larry L. Orr teaches program evaluation in the Department of Health Policy and Management, Johns Hopkins University, and consults on the design and implementation of large-scale evaluations. He directed research and evaluation offices in the U.S. Departments of Labor and Health, Education, and Welfare, and served as Chief Economist at Abt Associates. He has authored numerous books and articles, including the text, Social Experiments: Evaluating Public Programs with Experimental Methods.

Copyright of Evaluation Review is the property of Sage Publications Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

2014 Rossi Award lecture:* beyond internal validity.

For much of the last 40 years, the evaluation profession has been consumed in a battle over internal validity. Today, that battle has been decided. Ra...
102KB Sizes 1 Downloads 12 Views