Some problems in community program evaluation research.

Journal of Consulting and Clinical Psychology 1978, Vol. 46, No. 4, 792-805

Some Problems in Community Program Evaluation Research Emory L. Cowen University of Rochester Realities of the community context militate against good program evaluation research. Many limiting factors in such research stem from a clash in values between those who must deliver and those who must evaluate community services. Detailed consideration is given to several clusters of difficulties that plague community program evaluation studies, including (a) sources of data bias, (b) issues of design, (c) problems in the choice and use of criteria, and (d) problems of experimental control. Although community program evaluation studies can surely be improved, it is unlikely that the purity of antiseptic, laboratory research will ever be attained. Ultimate conclusions about the effectiveness of community service programs may thus have to come about slowly and cumulatively, based on convergent findings from many individual less-than-ideal outcome studies.

The author is well qualified to write about research errors in community mental health because he has, personally, committed virtually every one of them. Had the editor told authors to start their article with a pithy folk wisdom that captured its essence, mine would have been: "Do as I say, not as T do!" This article is intended as a straightforward consideration of several hazards that plague community program evaluation research. It is not designed to lecture or to pontificate. If the words come through as "holier than thou," it will reflect a serious communication failure. Starting this article as I did was dictated by other than modesty—or even masochism. It was to suggest that some problems of community research are so intrinsic to the nature of the beast that they are very difficult to surmount. As stated elsewhere (Cowen, Lorion, & Dorr, 1974), the choice for investigators in this field is often between doing far less than ideal research or no research at all. With that as preamble, a first practical task My thanks to two anonymous, but gracious and constructive, referees who reviewed the initial draft. Their wisdom and perccptiveness is reflected in the current incarnation. Requests for reprints should be sent to Emory L. Cowen, Department of Psychology, University of Rochester, Rochester, New York 14627.

is to sharpen the article's focus—somewhat easier said than done. Terms such as community research or research in community psychology or mental health are broad and amorphous (Cowen, 1973). In an earlier article, I suggested that although clinical psychology, community mental health, and community psychology share common prime concerns with people's adjustment, adaptation, security, happiness, self-concept, that is, their well-being, the nature and timing of their defining practices differ radically (Cowen, 1977). Thus, clinical psychology—as well as psychiatry and social work—has traditionally used repair strategies such as psychotherapy, addressed to already evident, crystallized problems. Community mental health's (CMH) roots lie in keenly felt dissatisfactions with the effectiveness of classic mental health repair systems. CMH does not abandon the casualtyrepair orientation; rather, it directs its efforts toward the perceived insufficiencies of past traditional approaches. The thrusts of the CMH movement are to identify problems earlier, in more natural settings (e.g., schools), and to use more flexible, hopefully more realistic approaches sometimes carried out by nontraditional help agents. The real importance of community in CMH is that it harbors settings and contexts that make it easier to do

Copyright 1978 by the American Psychological Association, Inc. 0022-006X/78/4604-0792$00.75

792

COMMUNITY PROGRAM EVALUATION RESEARCH

those things. Community psychology, sharing much the same goals as traditional or CMH approaches, departs sharply from them in its strategies. It is mass oriented rather than individual oriented, and it seeks to build health from the start rather than to repair. The preceding account is grossly oversimplified. It extracts distilled essences of pure models that are blurred and muddied, in nature, at many overlapping points. Moreover, the approaches are depicted unidimensionally when, in fact, each consists of an agglomerate of strategies, techniques and, ultimately, research terrains. Bloom (1977) documented that complexity for CMH, identifying 10 important ways in which such approaches differ from past traditional mental health practices. Of special interest to this discussion is Bloom's opening definition of CMH as "all activities undertaken in the community in the name of mental health" (1977, p. 49). In that vein, he cites as a first feature, which distinguishes CMH from traditional clinical activities, the fact that the former are based on practice in the community. Several of Bloom's later key discriminanda (e.g., emphases on early service delivery, indirect services, use of nontraditional manpower) are natural derivatives of the community locus of CMH programs. Important as the preceding structural emphases are, they do not yet begin to identify CMH's substantive complexities. The latter can be illustrated simply by noting several of the field's active current areas of programming and research: needs assessment surveys; alcoholism and drug-abuse programs; mental health consultation by, and for, diverse groups; varied types of crisis intervention programs; selection, training, and performance of nontraditional help agents; informal help-giving processes, natural caregivers, and community support networks; early detection and intervention; and alternative service delivery modes for inner-city and rural folk. A brief account such as this could not encompass the myriad of research problems of all the foregoing areas even if the author had (which he does not) the expertise to do so. That limitation, however, is more merciful than tragic. Though CMH evaluation research has clear defining qualities, it is not, thank goodness, a world unto itself with a totally

793

unique technology, methodology, and modus operandi. Its basic problems are related to those of other major areas of outcome research (e.g., evaluation of psychotherapy or educational programs). The full array of research problems includes those of designing, conducting, and interpreting studies. The greatest commonality in such problems between outcome research in CMH and other areas—and thus the topic Least pursued in this article—is error in interpreting findings. Thus, overinterpreting chance findings (e.g., the "meaning" of five Fs out of 100 significant at p < .05); confusing statistical significance with meaningfulness in interpreting correlation coefficients (e.g., r = .07, significant at p < .05, because N = 1,000); capitalizing on chance to develop a regression equation that is not cross-validated; or overreacting to a significant chi-square based on small expected cell frequencies, unmodified by Yates's correction, are common errors of data interpretation in many substantive areas—not just CMH. A key thesis of this article is that certain errors in designing and conducting CMH program evaluation studies flow naturally from the special hazards of doing research in the community. The latter include (a) the low priority that program evaluation research may have in an agency's hierarchy of values; (b) the researcher may be—or be seen as—a foreign body to the system; (c) the threat that evaluation research poses for a program's funding or personnel; (d) difficulties in gaining entry into community systems; (e) the complex research demands that longitudinal programs impose; (f) the vulnerability of community programs to change after they start; (g) the involvements of community review bodies; and (h) growing concerns about human rights and the invasion of privacy. Specific research problems considered later in this article can stem, directly or indirectly, from such qualities of community contexts. Accordingly, many gut problems of community research occur less because investigators do not "know any better" and more because reality keeps them from doing any better. One key "common denominator" obstacle to sound community evaluation studies is that the researcher's and agency's goals frequently

794

EMORY L. COWEN

work at cross-purposes. Schools are for leaching; courts are for meting out justice; community mental health center (CMHC) clinics are for working with patients, and hospital wards are to care for the sick. Community agencies share the mandate of bringing needed services to people. Their first concern must be with the extent and quality of those services. Their prime goal is to insure delivery of optimal services. The evaluator's first allegiance, by contrast, is to sound design, methodology, and instrumentation, which are so important to the defensibility of a study's conclusions. These perspectives can, and do, clash. The real rub comes when program personnel see the design of a prospective study as encroaching on, or restricting, services, and the researcher sees service pressures as a factor that will corrupt the study's integrity. To complicate the matter further, some service personnel see program evaluations as personal evaluations and, therefore, as threatening. Others, not sympathetic to research in the first place, or who feel harassed by heavy job pressures, resent the extra burdens that an evaluation study places on them. Whether or not a just God in heaven might judge such perceptions to be warranted is not the critical issue. Because they are real to beholders, they result in very real behaviors such as lack of cooperation, anger, passive aggression, delayed and/or careless completion of forms in ways that limit their validity, and interpretability. There are few guaranteed solutions for such problems. Involving program personnel in planning studies, taking time to explain the purposes and significance of the research to them, using maximally parsimonious measures that insofar as possible, are relevant to the respondents' job turf, and feedingback findings in usable forms are steps that can strengthen program evaluation studies. But even so, the differing needs and values of program workers and evaluators remain as background noise that often works against sound research. That less-than-idyllic backdrop frames consideration of specific problems in CMH program evaluation research in the sections to follow. Although not all of these problems lit neatly into pigeonholes, broad categories such as data and design bias, choice of criteria, and

the misuse of controls, clearly are chronic offenders. Data Bias In the past several decades, psychological research in general has become more sensitive to, and sophisticated about, sources of data bias. Observer judgments are highly subject to stylistic inputs such as social desirability responding, "yeasaying" or "naysaying," excessive use or avoidance of extreme ratings, and halo effects, any of which can be confounded with substantive variables that an instrument purports to assess. Although ways of instrument construction and usage have been developed that minimize stylistic confounds (or, at least, identify them after the fact), the corrosive effects of those variables still mar community program evaluation studies. This is so because such studies often depend heavily on the judgments of service recipients and providers as prime data sources. Few would disagree with the assertion that a client's view of how he/she has done in a program is one relevant way to evaluate the program's effectiveness. But there are many reasons why it is misleading to use such data as the only way of evaluating a program (Bloom, 1972). For brevity's sake, the issue can be put in concrete, caricatured form. When a program ends, the male client is asked, in 20 different guises: "So, how'd you do?" His response, also in 20 ways, is "Terrific!" Problem No. 1: Did he respond that way because that's how he feels or because he senses that that's what the experimenter wants to hear? Problem No. 2: If he truly does feel better, is it due to the program or because he just struck oil or won the Irish Sweepstakes? Problem No. 3: If indeed he does feel better because of the program, has his behavior changed in a parallel way? Service providers are also relevant—indeed, important—data sources in program evaluation, because they are uniquely familiar with clients and their everyday behaviors. But they too are sources of potentially serious data bias. They can bias a study, obviously, by not completing essential forms, by completing them carelessly, or by submitting them too late. More subtly, their responses to questions


about client behavior and how it has changed can be shaped by their stake in, and cathexes to, a program, that is, whether they believe that they are really being asked to evaluate the client's behavior or their own effectiveness; whether they see the program as theirs or others' and, if the latter, whether they like or dislike those people. The dilemma is clear. On the one hand, people who staff a program and/or receive its services have observations and information that are highly pertinent to its evaluation. But, for whatever reasons—many extraneous to the program's content and thrust—they do not always respond in those terms. The inter pretabilily and ultimate contribution of rater judgments may be increased by broadening the observational report bases. There is more danger of bias, pro or con, with only one versus several judges or one versus several perspectives (e.g., self, therapist, job, home). If observers with different stakes and perspectives agree about change, it is more likely that real change has taken place. Even more useful, if feasible, is including behavioral anchor points in the overall evaluation net. Correspondences between bona fide behavior change and the judgments of human observers increase one's confidence in the latter as a converging source of evidence in evaluating program effects. Design Problems Campbell (1969) has written a sophisticated treatise on program evaluation designs for community research—a topic somewhat beyond the scope of this article. The present discussion focuses on two specific design issues that have created special problems for community research—follow-up and systematic versus representative design. The purpose of follow-up is to insure that effects observed when a program ends accurately and stably mirror the program's impact. Follow-up data thus solidify generalizations about program effects over time. Such information is important for planning future programs. Immediate postprogram findings can be misleading in several respects. Thus, what seems to be improvement can dissipate over time, because it was not (a) real in the

795

first place, (b) solid enough to permit the individual to meet life's demands after the program ended, or (c) supported by the postprogram environmental context. Without follow-up, we can also underestimate program effects. Significant experimental-control differences may not be found when a program ends, but they may show up 6 months or a year later. Illustratively, a preventive program is developed for children experiencing current life crises who have not yet shown major signs of maladjustment. At the end of the program experimental subjects look much the same as comparable crisis (nonprogram) control subjects. When evaluated a year later, however, experimentals are found to have maintained sound adjustment, but nonprogram controls have deteriorated behaviorally and educationally. In such a situation one might infer that the program had important innoculative value but that it was too soon for that effect to be detected when the program ended. Without follow-up the community program evaluator is vulnerable to incorrect conclusions and generalizations. He/ she may thus either perpetuate a shaky program or dismiss an effective one prematurely. Brunswik (1947) wrote an informative essay on systematic and representative design in psychological experiments, later applied to research in clinical psychology (Hammond, 1954). Brunswik's main argument was that in order to generalize beyond a study's specific circumstances, one need not only have an adequate subject N, which most experiments do, but also experimental conditions that represent, statistically, the universe of circumstances to which the experimenter hopes to generalize. A simple example would be a comparative study of the effects of male versus female examiners on operant conditioning rates of male and female subjects. Assume that such a study was done with one male and one female research assistant, each of whom ran SO male and SO female subjects, for a total of 200 subjects. Although, on the surface, that seems a "reasonable" N, the question is, reasonable for what? How useful is it to know that two particular experimenters—one of whom happened to be male and the other female—got different conditioning results either overall or

796

EMORY L. COWEN

differentially by subjects' sex? Does such a finding mean that being a male or a female experimenter was the critical variable underlying the observed performance differences rather than, let's say, differences in their warmth, verbal styles, cues emitted, or degree of comfort with subjects? Most studies seek to generalize beyond their own literal conditions. For the hypothetical study cited, to generalize about the effects of male and female examiners would require representative sampling along the dimension of experimenter sex. Thus, in effect, the N for the study was not 200; it was ? in each experimenter's group. Community program evaluation studies are especially vulnerable to problems of systematic versus representative design. Such research often seeks to generalize about large community units (e.g., schools, CMHCs). The researcher, however, may have access to one or, at most, a very limited number of such units. Thus a study is conducted to learn how low-income subjects use mental health services within a CMHC and how effective those services are. The study, done at a large urban CMHC, involves a consecutive sample of 500 low-income subjects who sought services for whatever reasons, during the past X number of months. The study's main findings are that 54(/o of the subjects did not return after the initial visit and that short-term goal-oriented therapy was found (by whatever criteria) to be the most effective of four treatment conditions studied. Although such information may be very helpful in pinpointing the practices and strengths of the particular setting, a frequent error is to generalize the findings to CMHCs, or CMHCs in large urban settings. The problem in so doing is that the center in question has its own special defining qualities and practices (e.g., inviting/uninviting physical layout; poor/good reception and/or initial interviewing practices; and committed, dedicated short-term therapists), any of which can produce the specific findings obtained. To generalize about large CMHCs requires representative sampling on that dimension. The problem comes up in many guises. An investigator might wish to compare the effectiveness of clinical case versus process consultation. Such a study is well designed in terms of the number and types of groups (e.g.,

teachers, lawyers, clergymen) with whom the approaches were used and the numbers of contacts with each group. However, if there is only one consultant per approach, generalization with respect to the study's main question is drastically restricted. Because there are so many ways (e.g., experience, comfort, and confidence with an approach; personal warmth; verbal facility) in which the consultants could have differed, besides the ostensible variable under study (i.e., type of consulting approach), conclusions about the relative effectiveness of the approaches could not be made without representative sampling on the consultant dimension. Another example would be a comparative study of the attitudes and job satisfactions of mental health professionals and paraprofessionals. One convenient (often the most convenient) way of doing such a study is to recruit fairly sizable Ns for both groups from a single, large facility that employs, let's say, 20 professionals and 40 paraprofessionals. Assume that the study is done that way and that clear group differences in job satisfaction are found. The investigator concludes that there are basic cross-group (mental health professionals vs. paraprofessionals) differences in job satisfaction. But, again, because the study was clone in a single center, the findings are more likely to reflect that setting's particulars (e.g., hours or conditions of employment, salary levels, promotion policies, job security, how positions at various levels are perceived and valued) than generalized cross-group differences on the variable in question. Generalization of research findings depends on representativeness of design on all pertinent dimensions. This cannot ordinarily be achieved in one variable systematic design. If a community program evaluation study seeks to reach conclusions that transcend a particular setting, it must adequately sample the situations and variables that are central to its generalization focus, as well as the usual adequate sampling of subjects. Criterion Problems The present discussion assumes that classic psychometric problems (e.g., reliability and validity) are well understood at least Intel-


lectually, and that researchers as well as editors do what they can to keep the faith with respect to them. Hence, this section focuses more particularly on criterion problems that are predisposed by the special nature and pressures of community research. Two such groups of problems are considered: (a) the extent to which criterion measures are appropriate to a study's purposes and (b) pressures on the researcher to use less than optimal criterion measures. Are the criteria appropriate to the study's purposes? In considering how CMH approaches differ from traditional ones, one frequently noted characteristic is that they involve indirect rather than direct services (Bloom, 1977). Thus, we do consultation with public health nurses, pediatricians, teachers, and clergymen because these groups have extensive everyday contacts with distressed individuals. The rationale for consultation is that upgrading consultees' knowledge and skills helps them to be more effective in dealing with the personal problems that their clientele often bring to them. Psychologically oriented educational programs for those who are about to become parents, or for the parents of newborn or young children, have a similar rationale. Hopefully, strengthening the knowledge and/ or attitude bases of program participants can lead to more facilitating, health-producing child-rearing attitudes and practices. Thus, consultation and mental health education epitomize the structural pattern of indirect service; that is, they both are directed toward groups that have systematic contacts with and are hence in a position to help others. But the programs' ultimate concerns are with the beneficial effects that intermediaries have on target people. Mental health consultation and education seek to enhance the knowledge, feelings, and attitudes of the groups that they touch. It is often assumed that if those variables change positively, constructive change in the behavior of people with whom the intermediary interacts will ensue. Because indirect service programs have direct contacts only with intermediaries, and because first-line changes are easier to get at than once-removed ones, measures of change in consultees or parents are

797

often used as the only basis for evaluating their effectiveness. Even if criterion measures are well selected at that level and convincing evidence of change is found (i.e., consultees or parents are shown to have enjoyed a program, learned a lot from it, and to have developed more favorable mental health attitudes), it cannot be assumed that those changes lead to more effective helping, or growth-supporting, practices. Kelly (1971) addresses this issue: The payoff from a consultation program is not only an alteration of the feeling states, belief systems, and aspirations of the consultee, but should also reflect a change in a person's relationships with those significant others who directly participate in his life setting. Therefore evaluation studies should not measure change in attitudes of consultees, nor analyze samples of the interactions between consultant and consultee, nor note changes in the consullee's selfconcept, for such attempts at evaluation are not congruent with a conception of consultation as a preventive intervention, . . . If ... consultation is effective in initiating a change process, then indices of effectiveness should be denned not only by changes in consultee performance, such as the classroom teacher, but by cumulative and successive changes in the behavior of significant others, for example, students in the classroom. . . . When considering research designs to document the effects of consultation . . . it is essential to provide for assessment of the radiating effects of the intervention. . . . An intervention such as consultation can be preventive only if the consultee produces change in significant others, (pp.114-115)

The point to emphasize is that although there is nothing wrong per se with using waystation criteria such as changes in the knowledge and attitudes of consultees or parents in indirect service programs, such criteria alone are insufficient. So-called instrumental changes, if found, do not guarantee that positive behavior changes will follow in the ultimate target group. Without assessing the latter directly, there is the danger that all concerned parties will have had a pleasant, seemingly productive experience, that fails to help the program's ultimate targets. Although the preceding is a widespread problem, it is not universal. Behavioral consultation (Heller & Monahan, 1977), for example, is often evaluated only in terms of specific behavior changes observed in ultimate targets. Moreover, there are examples of (nonbehavioral) indirect service programs in par-

798

EMORY L. COWEN

ent education (e.g., Hereford, 1963; Glidewell, Gildea, & Kaufman, 1973), in which positive program effects (behavioral and adjustive) were indeed shown for the intended ultimate targets—children. Research designs that include measures of changes both in direct recipients and ultimate targets offer an added dimension richness in that linkages between the two levels of change can be explored. Pressures affecting the choice of criterion measures. Community background contextual factors, such as those discussed earlier, that interfere in general with program evaluation hit especially hard when it comes to selecting and using research criteria. If the basic climate is not conducive to evaluation, resistances to assessment procedures can, and do, develop. Such procedures can all too readily be seen as time-consuming, disruptive, and personally intrusive. Time-consuming is often denned phenomenologically rather than objectively. The author has had the experience of finding respondents more receptive to a 1-page format that required S minutes to complete than to a similar 10-page format that required only 3 minutes to complete. That aside, the key practical concern is that if key responders see a measure as too time-consuming, for whatever reason, it can effectively rule out that measure as a criterion. Take the following example: The main objective of a day-care program is to "resocialize" patients along dimensions such as (a) initiative, (b) self-help behaviors, (c) interaction with peers, (d) interaction with program personnel, (e) outside recreational activities, and (f) outside job activities. The researcher thus believes that judgments by knowledgeable program personnel about changes in specific behaviors, in each of the above subareas, would be among the criteria o f . choice for the study. Although he/ she develops an appropriate instrument to assess 10 specific behaviors per category that requires only 20 minutes per subject to complete, respondents decide they cannot, or will not, give that amount of time to the task. A frequent "compromise" solution in such circumstances is to use "quick and dirty" but vaguer and more abstract global ratings such as "interactions with peers." Although these may be useful, they also lose the richness of

the phenomena under the study, pull for generalized attitudinal responses to the program and its personnel, and narrow the base for evaluating specific program effects or identifying particular program strengths and weaknesses as a guide to its modification and improvement. The reality bugaboo of the potential disruptiveness of evaluation procedures is another source of noise that materially restricts the researcher's choice of outcome criteria. One might, for example, envision several other data-gathering strategies for the hypothetical study just cited such as (a) direct behavioral observations of subjects to assess variables such as peer interaction, autonomy, initiative, and reactions to program personnel; and (b) tests that provide data from which inferences about such variables could be made. Both approaches present hazards. Apart from the intrinsic complexities involved in developing reliable, valid frameworks for observing and recording behaviors, such procedures are often seen as intrusive and are therefore resisted. "Outsiders" must be introduced to the doings of an ongoing program, a process that program personnel or participants may see as disruptive, threatening, uncomfortable, or just plain not wanted. Similarly, removing subjects from ongoing program activities (e.g., schoolwork) for evaluations, especially time-consuming ones, also elicits resistance—if not to stop the procedure entirely, then to pare it to the bone. Sensitivity about potential invasion of privacy is another factor that restricts the use of certain criteria in evaluating community programs. A given program may seek to improve people's sexual adjustment; another may be aimed at solidifying disrupted parent-child relationships. But to probe directly in these sensitive (albeit face-valid) areas may be so threatening that the use of theoretically ideal criteria is blocked before the fact. Zax and Klein (1960) argued that behavioral criteria are often among the best to use in evaluating the effectiveness of mental health interventions. Because behavior, and its disruption, often defines and is at the nerve center of an individual's problem, and because it tends to be objective and palpable, using behavioral indices of change, is both face valid and commonsensical. Unfortunately, the inac-


cessibility of such data, plus the fact that it may be costly or time-consuming to obtain, has caused it to be underused in evaluating the effectiveness of community programs. The preceding constraints on the use of criteria put dire pressures on the community program evaluator to compromise. Compromise sometimes means using indirect measures; instruments that are out of phase with the program's goals; measures of unknown or dubious reliability and validity; and vague, global criteria that are difficult to relate to the study's focal variables. Though such problems are far from unique to community program evaluations, they are pronounced in that field. This is an especially tough blow, since variables of prime concern in CMH program evaluation research (e.g., health, pathology, adjustment, coping) are difficult enough to measure even when we have a "clean shot" at them (Bloom, 1977). Stated another way, program evaluation research is limited by the "state of the art" in assessment. Illustratively, the researcher may be asked to evaluate a program designed to strengthen the self-concepts of preschool children. If, however, a psychometrically sound measure of self-concept for children of that age is not available, the functional choice is between using an unsatisfactory measure of self-concept or a psychometrically more sound instrument that comes as close as possible to assessing that variable. Special criterion problems come up in outcome studies that cut across groups or settings. Thus, in a study designed to compare the effects of a specific intervention on the attitudes, intellectual performance, and/or adjustment of middle- and low-income groups, it is important that the criterion measures be equally appropriate for the two groups being compared. If not, extraneous factors such as lack of item clarity or inappropriateness of test items for either group might easily be confounded with differential outcomes for the method being evaluated. Much the same problem can occur in evaluation studies that cut across structurally comparable settings, which, however, use different assessment procedures. Cowen et al. (1974) have addressed this question in the context of a study designed to evaluate the ef-

799

fectiveness of a multidistrict school mental health program. Because the program was school based, an estimate of the child's current school performance seemed to be one reasonable criterion for evaluating its effectiveness. But going across school districts made it virtually impossible to obtain such a measure. Formerly, the classic "A, B, C, D, E" report card was an answer to the researcher's prayers precisely because it offered a nearly universal metric for evaluating current academic performance. But "them simplistic days" seem to be gone forever. The ancient grading system has been supplanted by a near infinity of variants—single, double and triple checks; red, blue, and green stars; or lions, tigers, and giraffes. Even more perplexing for the aspiring quantifier are the extensive freeprose reports used by many school districts these days to evaluate children. Such reports have been known to cover 20 or more pages and to be as much, or more, oriented to unfamiliar turf, such as identity problems, socialization skills, and self-concept as to the erstwhile, inviolable three Rs. However laudable efforts to find better, less competitive, less accusatory ways to assess a child's school performance are, they make it difficult, if not impossible, for researchers to use report card grades as a performance estimate in cross-district comparisons of program effects. Similar problems are encountered with educational achievement tests. School districts have become more and more individualized about which tests they use, for whom, and when such tests are given. Thus, in designs that cut across school districts, the experimenter may need to combine apples and bananas if he/she hopes to use performance and achievement data. Such reality factors will continue to restrict the researcher's use of criterion measures. In addition, the intrinsic complexity of many community evaluation studies poses challenges in selecting criteria. Complexity means several things—obvious ones such as (a) the multicomponent nature of many community programs and (b) the fact that they seek to affect multiple functions—sometimes differently for different people—and less than obvious ones, such as the fact that they usually take place in specific contexts of cost, morale,

800

EMORY L. COWEN

attitudes, and expectancies. Though it helps a great deal to know that a program "works," ultimately it is important to disaggregate component effects, separating active from inert ingredients and identifying differential program effects for participants with different characteristics. Because such information is critical for improving programs, criteria should be selected with those realities in mind. Because programs have multivariate, differential, and changing outcomes, multiple outcome criteria, including behavioral ones wherever possible, should be used to evaluate them. Doing so not only accurately reflects a program's true complexity, but it also reduces the risk of putting all one's eggs in a single delicate criterion basket. Greater use should also be made of unobtrusive and/or unintended outcome measures; they are relatively easy to collect and potentially informative. Thus, an investigator might weep a bit less after finding that children in a school mental health program improved at p < .09, rather than p < .05, on a problem behavior inventory, were it also established that the principal had 30 fewer disciplinary referrals during the active program period. And, finally, in evaluation studies that require human judgments (by program personnel or target persons), pragmatic decisions to use brief, objective, and easy to understand and handle measures that are relevant to respondents' main interests and spheres of involvement can help the cause immeasurably. Problems of control. Problems of control, like those of criteria, are basic to most areas of psychological research. Their uniqueness, if any, to community program evaluation research derives from the realities of the community context. Control, in evaluation studies, seeks to insure that changes in behavior and/ or performance are due to the effects of the intervention, rather than to potentially confounding variables that can produce similar findings. Being cast as a control is basically a dull, unenviable fate. It commits the "victim" to all the "dirty work" of research (e.g., time loss, intrusiveness, disruptivcness) without direct "pay off"—characteristically defined as needed services. Small wonder that relatively few places jump through hoops to be controls. Since many settings (schools, hospitals, clin-

ics, courts) are unwilling to serve as controls under any circumstance, the theoretical pool of control groups is limited before the researcher ever gets to it. Moreover, locating "any old" control group is hardly enough to assure a good outcome study. Ideally, experimentals and controls should be matched on a host of variables, which, left unattended, could confound the findings. Although the exact nature of these variables depends on the program's nature and purposes as well as characteristics of the subjects, they often include age, sex, intelligence, race, socioeconomic status, and the nature and extent of preprogram maladjustment. Both because many settings eschew the control role and because experimentals and controls must be matched on many control variables, the experimenter may have few degrees of freedom in searching for appropriate community control groups. At the very least, it can be a vexing, time-consuming challenge. Not surprisingly, then, many program evaluation studies are done without control groups. Such studies depend on within-group pre-post comparisons, which, though sometimes helpful, entail certain risks. For one thing, they do not control for spontaneous or natural change over time; for another, if they rely heavily on human judgments of change—by subjects themselves or by program personnel—they are particularly susceptible to distortions such as response bias, Hawthorne, and reverse Hawthorne effects (Zdep & Irvine, 1970). Still another danger of studies that lack controls is the tendency of initially extreme test scores to regress to the mean on readministration. Such natural regression on the surface, "looks like" improvement and can be confused with it. Finally, base rates for some behavioral criteria (e.g., delinquency rates and employment) that are appropriate in evaluating the effectiveness of certain programs change rapidly over short time periods (e.g., ages 12-14 for delinquency or 16-19 for employment). To use such criteria, without controls or anchoring base-rate data, could seriously distort a study's interpretation (Freeman & Sherwood, 1969). Another community reality that undermines optimal control is the fact that experimental programs must often start at a certain time


of year and run for X period of time if they are to be evaluated at all. School-based programs, for example, are bounded by the beginning and end of the school year. The program must start when it must start, even if an adequate control group is not available. Following all due effort an approximately satisfactory control group may be located several months later. But, by then, children manifest different patterns of class adjustment problems (either because they are better known or due to normal seasonal variations), and class sociometric structures have changed from what they were 2 months earlier. It is difficult to know, when experimentals and controls are assessed at different times of the year, whether scores on key criterion measures such as the preceding ones mean the same thing for the two groups. Given the realities of the community research scene, many adaptations to, and "solutions" of, the control dilemma have been tried —-some voluntarily, some otherwise. The owncontrol procedure, in which a group goes through an inert preprogram wait period, is designed to bypass the thorny problems of finding matched controls. A variant of that approach is for a group to serve simultaneously as a matched control for an experimental group and as an own control for a specified time period, with the understanding that it will later participate in the regular program procedure. Although both of those variants can be useful, they may still present problems, for example, delay in providing services to individuals who need them or the dangers of systematically selecting as time controls subjects with less pressing service needs. In some situations the most direct route to control is to subdivide a prospective pool of target subjects, within a given setting, into matched experimental and control subgroups. This approach is attractive both because it is convenient and because it may be easier to find a good overall match among within-setting subjects who share similar backgrounds, sociodemographic status, and histories. But there are also reasons why it may not work. Thus, if the program involves needed services, personnel from the setting may argue—sometimes vocally and insistently—that those with the greatest need must have first call on ser-

801

vices. Withholding a promising service from someone who needs it badly to satisfy the niceties of an abstract research design simply will not fly. Indeed, if pushed too hard, it can sound the program's demise—either before it starts or through later noncooperation or active resistance. It takes no genius to imagine the public relations confusion that can ensue. The obvious point is that community research has a real, vital ecological surround—more so than almost any other area of psychological research. That surround must be taken into serious account at all stages. Another factor that limits the usefulness of within-setting control is intrasetting communication about a program. If, for example, the program intervention involves the use of cottage parents to teach verbal mediational techniques of self-control to residential delinquent adolescents, there is the danger that training procedures and program practices will spill over from experimentals to controls in a given institutional setting. Many community intervention programs involve indirect services such as consultation. If teachers are targets of a consultation program, it is unrealistic to expect that they will apply newly learned skills only to experimental children in their class and not to controls, or that they will not discuss useful new discoveries with other (nonexperimental) teachers in the building. Control, as many investigators have learned, can be an elusive phenomenon; that is, "now you see it, now you don't." Seemingly pretty initial matches evaporate in. the face of hazards beyond the experimenter's control. Thus an experimenter, who must start a program by a certain date, matches experimental and control groups on all major sociodemographic measures but not on preprogram adjustment or performance status. Because the latter data take longer to collect and score, the experimenter makes the perfectly reasonable assumption that random assignment of subjects will yield approximately matched groups on those dimensions. But it does not! Or, after careful matching is completed, attrition while the program is underway destroys the match. There are many reasons why such attrition occurs. People move, rater judges who furnished predata change jobs; administrative decisions are made to shift individuals out of

802

EMORY L. COWEN

a program; needed data cannot be collected or prove to be invalid. And so it goes! If everyone who has been burned by such a problem submitted a brief description of his/her special headache, the resulting compilation of unanticipated, and "undeserved," false turns would fill many entire issues of this journal. After-the-fact loss of initial control for any of the preceding reasons is common, not rare, in community program outcome research. That is one reason why not all program evaluation studies reported—and even fewer of those done—ever "mess with" controls. Of those that do, a substantial number, present company included, have been victimized either by incomplete or far less than ideal initial control, or by unavoidable breakdown of control during the study. Faced with such natural disasters, investigators who care about control have several compensatory options to pursue. One is to bail out, as best one can, statistically. Analysis of covariance is a generic procedure designed to minimize initial mismatches and to bring comparison groups back to approximately the same starting point. Another way to deal with initial mismatches is to hack and chisel at the subject groups in hopes of paring them down to approximately comparable samples. Among the dangers of this procedure is that the adjustments may have to be asymmetrical (either because the original disproportionality comes more from one group than from the other or because the N is more robust in one group than the other). If the reduction in N comes primarily from one group, it may distort the group's defining characteristics. Moreover, such a procedure often highlights an inherent conflict between the ideal of a tight match (which, necessarily entails loss of N) versus the robustness and representativeness of the samples retained. Since this conflict is real, it is sometimes resolved by tolerating noise in the match to preserve sufficiently large and representative Ns for the major substantive program evaluation analyses. An interesting but more subtle question that the researcher sometimes faces is, When is a control not a control? Thus, groups can be well matched statistically but not "psychologically." Take as an example a program in which crisis interventionists are trained to

use special abreactive techniques in which they have interest but no special investment. The experimental question is whether the use of such techniques improves intervention outcomes with people experiencing current life crises. The groups seen by the specially trained and "regular" workers are well matched demographically and in terms of the nature and seriousness of the crises that they have experienced. The study's criteria include the clinician's preratings and postratings of a series of relevant patient behaviors. In such a situation, experimental and control interventionists may have different cognitive sets about the study, with experimentals thinking that "they are evaluating the effectiveness of this new program" and controls believing that "they are evaluating my effectiveness as a clinician." If such differential program views exist, experimentals' postprogram ratings are more likely either to be objective or to reflect personal (pro or con) views of the program, whereas controls, who see themselves as the focus of the study, are more likely to provide positive change estimates for clients. Should that happen, genuine program effects are obscured or lost. A similar example can be cited in evaluating the effectiveness of schoolbased intervention programs. Teachers of experimental children in such programs are more likely than those of controls to see their behavioral evaluations of children as program related. Hence, their judgments may be influenced by their views of, and attitudes toward, the program. By contrast, teachers of control children, lacking a program metric, are more likely to see the rating task in the context of how good a job / have personally done with Johnny or Mary this year. If so, there is a pull for them to give more positive end-ofyear ratings (Cowen et al., 1974). Standard statistical controls may also not suffice in situations in which a program's main content and activities happen, incidentally, to involve a major structural change in the everyday lives of the target subjects. Assume, for example, that college student volunteers are trained to work with chronic hospitalized patients using rational-emotive techniques. Is it sufficient in evaluating the effectiveness of the intervention to have a matched experimental and control group? Probably not! Such a de-


sign might confound the program's ostensibly active ingredient (i.e., the therapeutic approach) with the fact that the procedure, incidentally, involved establishing a meaningful interpersonal relationship with people whose lives ordinarily lacked such relationships. An ideal third group that would strengthen the study's interpretive base, is one that could control for the personal relationship (e.g., with games and recreational activities) in the absence of an intentional, therapy-system thrust. Similar examples can be identified for active therapeutic programs with adolescents in correctional settings, children in institutions for the retarded, or geriatric patients. The unusual complexity of certain community settings (e.g., hospitals and schools) plus the fact that many are veritable laboratories for exploring many, ever-changing, program variations underscores the fact that mere demographic-statistical comparability of experimentals and controls does not automatically solve the control problem. From an experimenter's standpoint, evaluation of a specific program would be clearest if there were no other special programs in either the experimental or control settings. Indeed, experimenters' special blinders may impel them to see the world that way, even though that view does not correspond, ecologically, to reality. More often than not, settings such as schools and hospitals house a variety of programs— formal and informal and short- and long-lived. Some of these programs may address the same objectives and behaviors as the experimental program in question (Freeman & Sherwood, 1969). A behavior modification program for hospital patients may take place alongside of drug-therapy and patient ward-governance programs. A school mental health intervention may co-occur with Glasser circle and Distar programs (Cowen et al., 1974). The intermixing of such programs not only makes it difficult to evaluate their separate contributions but also often means that an ostensibly pure experimental program is in fact that program plus several overlapping services or programs in one setting, compared to another (so-called control) setting, which happens not to have that particular program but does have three or four other programs addressed to similar

803

behaviors in comparable target subjects. Sometimes, in fact, an administrative decision is made to assign the special program to one setting because (compared to other similar ones) it is deficient in the type of services that the program provides. Conversely, control settings may be assigned other similar programs as part of an (understandable) administrative philosophy of sharing the wealth. Practical problems of control, in such situations, are magnified by the facts that some of the overlapping programs, either in experimental or control settings or both, are likely to be shortlived or to change in the processs, and new programs may be introduced while the experiment is in progress. Although each of the foregoing possibilities is regrettable experimentally, they are part of the community's reality. Problems of proper control, in community research, are diabolically complex and create serious persistent stumbling blocks to sound program evaluation research. Overview and Summary Communities are many things. One thing they are not is an ideal laboratory for antiseptic psychological studies. Their extraordinary complexity, omnipresent flux, action-service orientation, and susceptibility to dayto-day pressures present real and formidable barriers to "Mr. Clean" program evaluation studies. These factors place major constraints on the design of studies, the types of criteria that can be used, and the rigor of sophistication of the control that can be exercised. Although some of those problems can be reduced through judicious planning, others, quite beyond the experimenter's control, cannot. This is one reason why theory, logic, and the actual development and implementation of new community programs have outpaced the field's supporting research base. The tugs and pulls of this situation are clear. On the one side is the obvious need to pose important, socially significant questions and to understand the impact and value of innovative practices designed to overcome long-standing, refractory problems in mental health. On the other are our training and bloodlines as experimenters and our un-

804

EMORY L. COWEN

derstandings of past accepted canons for accreting new knowledge. These opposing tensions are as apparent in community research as in any subdomain of psychology today. The intent of this article is not to discourage trying harder. Such effort is sorely needed; it can have great payoff value. Much can be done to strengthen community program evaluation technology and to design studies that reduce sources of confound or error. Weaknesses in specific measures or in classes of criteria typically used in community program outcome research dictate that greater emphasis be placed on converging sources of evidence. But we must still expect that community realities will remain to militate against ideal research studies. The vulnerability of findings from any single community evaluation study points to the importance both of replication and of tolerance for a slow accretive process, in which small pieces in a puzzle gradually cumulate toward weight-of-evidence conclusions about major new programming approaches. Although such a process is not intrinsically inimical to the way of science, it may be more caricatured in community research than in other fields. The compelling logic of the community approach, the significance of the problems it addresses, and the excitement and clinical promise of some of its early innovative programmatic efforts have been sufficient to carry the field's infancy and early childhood. The future, however, will stand or fall on the solidity of its empirical footing. Social significance cannot, in that process, be sacrificed at the altar of laboratory precision. Hence, we must expect that successive approximations—the gradual putting together of sometimes chipped or scarred building blocks—will be the way of community program evaluation research in the coming decades. For the reader who seeks wisdom and sophistication beyond the frailties of the present account, the following additional sources are suggested: Schulberg, Sheldon, and Baker, 1969; Bloom, 1972; Roen, 1971; Glass, 1976; Hammer, Landsberg, and Neigher, 1976; Fairweather and Tornatzky, 1977; Neigher, Hammer, and Landsberg, 1977; and GuttentagandSaar, 1978.

References Bloom, B. L. Mental health program evaluation. In S. E. Golann & C. Eisdorfcr (Eds.), Handbook of community mental health. New York: ApplclonCenlury-Crofts, 1Q72. Bloom, B. L. Evaluating achievable objectives for primary prevention. In ID. C. Klein & S. E. Goldston (Eds.), Primary prevention: An idea whose time has come (Department of Health, Education, and Welfare Publication No. (ADM) 77-447). Washington, D.C.: U.S. Government Printing Office, 1Q77. Brunswik, E. Systematic and representative design of psychological experiments. Berkeley, Calif.: University of California Press, 1947. Campbell, D. T. Factors relevant to the validity of experiments in social settings. In H. C. Schulberg, A. Sheldon, & F. Baker (Eds.), Program evaluations in the mental health fields. New York: Behavioral Publications, 1969. Cowcn, E. L. Social and community interventions. Annual Review of Psychology, 1973, 24, 423-472. Cowen, E. L. Baby-steps toward primary prevention. American Journal of Community Psychology, 1977, S, 1-22. Cowen, E. L., Lorion, R. P., & Dorr, D. Research in the community cauldron: A case report. Canadian Psychologist, 1974, 15, 313-325. Fairweather, G. W., & Tornatzky, L. G. Experimental methods for social policy research. New York: Pergamon Press, 1977. Freeman, H. E., & Sherwood, C. C. Research in large scale intervention programs. In H. C. Schulberg, A. Sheldon, & F. Baker (Eds.), Program evaluations in the mental health fields. New York: Behavioral Publications, 1969. Glass, G. V. (Ed.). Evaluations studies review annual (Vol. 1). Beverly Hills, Calif.: Sage Publications, 1976. Glidewcll, J. C., Gildea, M. C-L., & Kaufman, M. K. The preventive and therapeutic effects of two school mental health programs. American Journal of Community Psychology, 1973, 1, 29S-329. Guttcntag, N., & Saar, S. (Eds.). Evaluations studies review annual (Vol. 2 ) . Beverly Hills, Calif.: Sage Publications, 1978. Hammer, R. J., Landsberg, G., & Neigher, W. (Eds.). Program evaluation in community mental health centers. New York: D and O Press, 1976. Hammond, K. R. Representative vs. systematic design in clinical psychology. Psychological Bulletin, 1954, 51, 150- 159. Heller, K., & Monahan, J . Psychology and community change. Homcwood, 111.: Dorscy Press, 1977. Hereford, C. F. Changing parental attitudes through group discussion. Austin: University of Texas Press, 1963. Kelly, J. G. The quest for valid preventive interventions. In G. Rosenblum (Ed.), Issues in community psychology and preventive mental health. New York: Behavioral Publications, 1971.

COMMUNITY PROGRAM EVALUATION RESEARCH Neighcr, W., Hammer, R. J,, & Landsbcrg, G. (Eds.). Emerging developments in mental health program evaluation. New York: Argold Press, 1077. Rocn, S. R. Evaluative research and community mental health. In A. E. Bergin & S. L. Garfield (Eds.), Handbook of psychotherapy and behavioral change: An empirical analysis. New York: Wiley, 1971. Schulberg, H. C., Sheldon, A., & Baker, F. (Eds.).

805

Program evaluation in the mental health fields. New York: Behavioral Publications, 1969. Zax, M., & Klein, A. Measurement of personality and behavior changes following psychotherapy. Psychological bulletin, 1960, 57, 435-448. Zdep, S. M., & Irvine, S. H. Reverse Hawthorne effect in educational evaluation. Journal of School Psychology, 1970, 8, 85-95. Received December 31, 1977 •

Quantitative Evaluation of the Community Research Fellows Training Program.

[The evaluation of community psychiatric services. Some methodological problems (author's transl)].

Medical research in Africa: problems and some solutions.

Increasing research literacy: the community research fellows training program.

Clinical research--some legal problems of the pharmaceutical manufacturer.

Community influences on mental health program evaluation.

Program evaluation in the public interest: a new research methodology.

Evaluation of a Health Professionals' Training Program to Conduct Research in New York City's Asian American Community.

Some problems in understanding other people: analysing talk in research, counselling and psychotherapy.

[Problems in clinical research].

The role of thematic evaluation in program assessment: the case of the Community Clinical Oncology Program.

Some problems of day hospitals in community care of the mentally ill.

Impact evaluation of the community mental health program at habra.

Improving Immunization Rates Through Community-Based Participatory Research: Community Health Improvement for Milwaukee's Children Program.

Imported mycoses: some diagnostic problems.

Pharmacokinetics of gestagens: some problems.

Program evaluation application of a comprehensive model for a community-based respite program.

Some Medical Problems in China To-Day.

Some Psychological Problems in "Institutionalising" Defectives.

Spina bifida: some problems in management.

Some current problems in amphibian limb regeneration.

Research problems in clinical diagnosis.

Empowering Patients and Community Online: Evaluation of the AIDS Community Information Outreach Program.

Evaluation of a community reintegration outpatient program service for community-dwelling persons with spinal cord injury.