JOURNAL

OF COMMUNICATION

DISORDERS

8 (1975), 237-247

COMMUNICATION DISORDERS: A POWER ANALYTIC ASSESSMENT OF RECENT RESEARCH ROBERT

M. KROLL

Director of Speech Pathology I Clarke Instirute of Psychiatry, 250 College Street, Toronto 26, Ontario, Canada

LAWRENCE Assistant Professor

of Communication,

J. CHASE

The Cleveland State University,

Cleveland,

Ohio

This study assessed the relative statistical power of contemporary research in communication disorders. Results of the analysis, based upon an evaluation of two majorjournals, revealed overall mean power figures of 0.16, 0.44, and 0.73 for small, medium, and large effect sizes, respectively. Interdisciplinary comparisons indicated that low statistical power is not unique to research in communication disorders, but is apparent in other behavioral science areas as well. Several alternatives are offered to the researcher who will want to ensure sufficient power for his investigation on an a priori basis. The implications of this study are discussed in reference to the experimenter/clinician model.

Introduction In recent years, the amount of emprical research conducted in the field of speech pathology and audiology has increased substantially; associated with this trend has been the widespread use of statistical procedures in order to facilitate data analysis. Concomitant methodological advances in clinical practices have also occurred. Yet, “there seems to be a persistent and devastating attitude shared by many speech pathologists that research activity and clinical practice are dichotomous” (Hamre, 1972, p. 542). Indeed, many notables in the field of speech pathology and audiology have addressed themselves to this issue (Darley, 1969; O’Neill, 1970; Goldstein, 1972; Hamre, 1972; Schultz, Roberts, and Yairi, 1972). Generally, these authors have stressed the need for integration of research and clinical functions so that the common goal of generalizability may be better served. That is, increased communication between the experiment and clinician should yield valid and more extensive instruments and techniques, thereby expanding the combined therapeutic arsenal. While the integrative model represents a meaningful step forward, it should also be noted that each component requires periodic evaluation and assessment. Thus, the clinician must constantly examine the efficacy of the techniques and materials utilized in the therapeutic situation, while the experimentalist should seek to @American Elsevier Publishing Company,

Inc.,

1975

237

238

ROBERT

M. KROLL

and LAWRENCE

J. CHASE

improve and validate experimental operations. Only through comprehensive and rigorous procedural appraisal can the experimenter/clinician partnership be maintained efficiently. A review of the literature indicates that several investigators have examined the clinical applicability and validity of a variety of tests and techniques (see, for example, Griffith and Miner, 1969; Eisenstadt, 1972; Starkweather, 1971; Baird, 1972; Beasley and Manning, 1973). However, methodological assessments of research in speech pathology and audiology are virtually nonexistent. We contend that the increase in empirical investigation in our field warrants such an evaluation. Increases in experimentation (and the development of data-analytic sophistication) in the other social sciences have been accompanied by these kinds of critiques, thus strengthening the case for an analysis of research in communication disorders (see, for example, psychology: Cohen, 1962; Rosenthal and Gaito, 1963; Tversky and Kahneman, 197 1;sociology: Duggan and Dean, 1968; educubon: Brewer, 1972; communication: Katzer and Sodt, 1973; Chase and Tucker, 1975). The purpose of the current investigation, therefore, was to conduct a methodological assessment of research in speech pathology and audiology. Specifically, thestutisticulpower of our research was examined. Prior to presenting the methodology and results of this study an explication of statistical power analysis is required. Statistical

Power Analysis

According to Cohen ( 1969, p.4), “the power of a statistical test of a null hypothesis is the probability that it will lead to a rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists. ’ ’ Statistical power is a function of three other parameters: (1) significance level (alpha; Type I error rate); (2) sample size; and (3) effect size. The significance level is the predetermined probability of committing a Type I error (the false rejection of a “valid” null hypothesis), and is typically set at either 0.05 or 0.0 1 in the behavioral sciences. Sample size refers to the number of subjects(N) utilized in an experiment, and effect size is the degree of departure the experimenter expects to detect in the population (or, equivalently, the proportion of variance in the dependent variable, e.g., instances of stuttering that is accounted for by the independent variable, e.g., treatment type). The relationship of power, alpha, sample size, and experimental effect is more complex than the definitions offered above might indicate, however. The power of a test is partially determined by the alpha level, in that a “false” null must fall within the critical region to be rejected. As the upper bound of the critical region (alpha level) decreases in size, the probability of accepting a false null is increased. For example, if an experiienter conducts a one-tailed t-test with 30 subjects per cell, and sets his alpha level at 0.0 1, he can expect to reject his null hypothesis only 32 times over 100 trials (power = 0.32, assuming a medium effect size). The

POWER

ANALYTIC

ASSESSMENT

OF RECENT

RESEARCH

239

corresponding beta error for this test would be 0.68 (1 - power = beta); in essence, this experimenter would rather commit a beta error instead of an alpha error by a 68 to 1 ratio! Had this investigator set his significance level at 0.05, the power of the test would have been 0.61 (Cohen [ 19651 recommends power = 0.80). Another method of increasing statistical power, and consequently lowering the risk of committing a beta error, is to increase the number of subjects. According to Cohen (1962, p. 145), “other things equal, power is a monotonic function of sample size.” Thus, if the above experimenter wanted to achieve power = 0.80 (and beta = 0.20), he would increase the number of subjects per cell to 84. The required sample size for the same experiment in which the 0.05 level of significance was employed would have been only 50. However, most behavioral scientists do not have an unlimited subject pool and cannot always obtain the desired number of subjects. This is especially true when specific subject types (e.g., stutterers, hearing-impaired persons) are required. An alternative to increasing the sample size is to seek a larger experimental effect. In order to illustrate this relationship, the concept of experimental effect must be further described. Cohen (1969) explicates effect size as follows: When the null hypothesis is false, it is false to some specific degree, i.e., the effect size (ES) is some specific nonzero value in the population. The larger this value, the greater the degree to which phenomenon under study is manifested (p, 10).

When the effect size is large, the power of the statistical test is increased, other things equal. The relation between effect size and sample size, as presented in Table 1, may provide further clarification. Table 1 illustrates the inverse relationship between effect size and sample size with regard to statistical power; the larger the experimental effect, the smaller the required sample size. The difficulty of ascertaining a priori the effect size constitutes the basic indeterminacy of statistical power analysis. In contrast to sample size and significance level, the effect size cannot be accurately determined until the analysis is complete. Therefore, an investigator must estimate experimental effect. Cohen (1965) has provided three methods:

Sample Size Required

TABLE 1 for Power = 0.80; t-Test,

Effect size Small effect (0.20 s) Medium effect (0.50 s) Large effect (0.80 s)

One-Tailed,

no = n2, (Y = 0.01

Required

sample size

n = 500 n = 84 n = 33

240

(1) (2) (3)

ROBERT

M. KROLL

and LAWRENCE

J. CHASE

He can estimate the effect size, using his knowledge of previous studies in his field, as or “large.” “small, ” “medium,” “He can simply take the criterion for the medium effect size as convention, or, better still . .” “... he can select a series of reasonable effect sizes and work with them as a conditional manifold” (p.97).

Once the experimental effect has been approximated, the investigator may then estimate the statistical power for the hypothesis test. Cohen (1969) has provided tables giving statistical power values for a wide range of research designs, and has categorized them according to the type of statistic employed (e.g., F ,t, X2, r) and level of experimental effect. If the estimated statistical power is found to be inadequate, the investigator may wish to work with a larger effect size. Although this alternative may require redesigning the experiment, the advantages derived may well be worth the effort. Cohen (1973) explicated the necessity for larger effect sizes in his critique of Brewer’s (1972) power analysis: Obviously, the solution to the problem of very small effect sizes in behavioral science is to emulate the older sciences: strive toward developing the insights which lead to research procedures and instruments which make effects measurably huge enough to be detected by experiments of reasonable size. Nuclear physicists have spent entire careers and millions of dollars to make subatomic events sufficiently palpable to be studied. The educational researcher pursuing a theoretical problem must first become aware of thesizes of effects, then, if the one he is pursuing is very small (usually the case at the beginning), he must apply himself toward making it bigger, rather than passively wanting to detect it ‘regardless of how small the effect is,’ and heedlessly ‘strive for significance.’ (pp. 228-29)

Had the researcher in the above example redesigned his study so that the expected effect size was large, the resultant power of the one-tailed t-test witha = 0.05 and nl = nz = 30 would have been 0.92. The Importance

of Statistical

Power

Statistical power should be of major concern to behavioral scientists for three primary reasons. First, and most importantly, inadequate statistical power may prevent the researcher from attaining statistical significance, thus increasing the probability of a Type II (beta) error. Associated with this risk is the likelihood of presenting misleading results in a research report. When a research hypothesis has been rejected, the factors contributing to the obtained outcome should be fully delineated. This is especially meaningful when the statistical power of the experiment was inadequate. Certainly further research should be advised when the a priori odds against rejecting the null were 68 to 1 (as was the original case with our hypothetical investigator). A third reason for considering statistical power concerns the utilization of factorial designs. One may argue that a significant main effect is not affected by

POWER ANALYTIC

ASSESSMENT

OF RECENT RESEARCH

241

low statistical power. However, power is calculated for interactions as well as for main effects, and a significant interaction renders the main effects essentially uninterpretable (see, for example, Campbell and Stanley, 1963; McNemar, 1969; Edwards, 1972). Thus, inadequate statistical power for tests of interactions should be a source of concern for experimenters using factorial designs. There are other reasons for taking statistical power into account when planning an experiment. However, a complete discussion here is inappropriate. The reader is urged to consult Cohen (1965)) Overall ( 1969), Katzer and Sodt ( 1973), and Chase and Tucker (1975).

Method Characteristics

of the Sample

The present investigation assessed the statistical power of recently published research in the field of communicative disorders. The two journals selected for this analysis were the Journal of Communication Disorders and the Journal of Speech and Hearing Research. These journals provide a representative sample of contemporary research in speech pathology and audiology. Every 1973 issue of both journalsand those 1974 issues that had been published prior to the inauguration of this study, were included. All articles incorporating significance tests were examined. The average statistical power for the three effect sizes was computed for each study using the tables provided by Cohen ( 1969); appropriate measures of central tendency and dispersion were also calculated. Effect Size Indices In order to determine small, medium, and large effect size estimates, Cohen’s (1969) measures of experimental effect were utilized. Effect size indices were provided for t-tests, tests for normal proportions. tests for normal correlations, tests for significance of correlations, sign tests, chi-square tests, and F tests. These measures, expressed in nonmetric standard score units, are presented in Table 2. Standard and Limiting

Conditions

For the purpose of consistency within this analysis, certain criteria were standardized. The conditions that were held constant throughout the current assessment included: ( 1) level of significance and (2) orthogonality of design. Thus, the Type I error rate was preset at 0.05 for all calculations, and the mean number of subjects was used regardless of whether cell sizes were equal. It should be further noted that some tests of significance were omitted from the

242

ROBERT

M. KROLL

and LAWRENCE

TABLE 2 Effect Size Indices for Statistical

J. CHASE

Tests Standard

Test t (Two means are equal) Normal (two proportions are equal) Normal (two T’S are equal) r(r = 0) Sign test Chi-square (proportions) Chi-square (contingency) F test

score values

Small

Medium

Large

0.20 0.20 0. IO 0.10 0.05 0.05 0.05 0.10

0.50 0.50 0.30 0.30 0.15 0. IO 0.10 0.25

0.80 0.80 0.50 0.50 0.25 0.20 0.20 0.40

current analysis for the following reasons: (1) On several occasions, tests for which power determinations could not be made were employed, e.g., KruskalWallis One Way ANOVA; Mann-Whitney U Test; (2) improper applications of parametric statistics to nominal/ordinal data prevented the computation of statistical power, as it would have no practical value (3) inconsistent reporting of pertinent data, such as the deletion of ANOVA summary tables, further complicated the determination of statistical power. In the absence of summary tables, attempts were made to retrieve the relevant information from the text of the article. However, such information was often not available, or was uninterpretable, thereby precluding further analysis.

Results and Discussion Overall Survey Data

Both the Journal

of Communication Disorders and the Journal of Speech and were composed of four issues in 1973. At the inception of this project, one issue of the Journal of Communication Disorders and two issues of the Journal of Speech and Hearing Research had been published in 1974. From these issues, 62 articles were power analytically examined, resulting in a consideration of 1,037 major statistical tests. The median number of tests per article was eight. Table 3 illustrates the distribution of these tests with regard to estimated statistical power. Only one of the articles surveyed attained the recommended Small effects. statistical power for a small effect size. Moreover, only 3 of the 62 articles achieved power equal to or greater than 0.50, given a small effect size. The power estimates reveal that the investigators contributing to these journals preferred to

Hearing

Research

POWER

Frequency

Power

ANALYTIC

ASSESSMENT

OF RECENT

RESEARCH

243

TABLE 3 and Cumulative Percentage Distributions of the Power of the 62 Articles to Detect Small, Medium, and Large Population Effects

Freq.

Small effect Cum. Pet.

Medium effect Freq. Cum. Pet.

Freq.

Large effect Cum. Pet

> 0.99 0.90-0.98 0.80-0.89 0.70-0.79 0.60-0.69 0.50-0.59 0.40-0.49 0.30-0.39 0.20-0.29 0.10-O. 19 < 0.09 n Mean Median Mode 91 Pa % > 0.50 % > 0.80

1

5 3 3

1

-

98 97 96 94 89 79 29

1 1 3 6 31 18

I 5 12 16 13 3

1 1 -

-

62

62

62 0.16 0.12 0.13 0.08 0.19 4 2

100 87 79 68 48 29 13 3 2

8 5 7 12 12 10 6

loo 98 90 85 81 79 11 52 26 5

100

0.73 0.71 0.99 0.57 0.86 87 32

0.44 0.38 0.30 0.29 0.55 29 15

-

commit a Type II or beta error over a Type I or alpha error by an average ratio of 17 to 1, given a small effect criterion. Medium e#ects . The average power, assuming a medium experimental effect, was 0.44. Almost 30% of the sample ( 18 of 62 articles) attained an average power equal to or greater than 0.50; 9 studies exceeded the recommended statistical power of 0.80. If a medium effect size is posited, the ratio Type II to Type I errors is approximately 11 to 1. Large effects. The situation improves markedly when large effects are examined. The average power for a large effect size was 0.73. Only eight of the surveyed articles attained average power values of less than 0.50. Thus, 54 of the 62 investigators gave themselves at least an even chance of detecting an experimental effect (given a large effect size criterion). Moreover, almost one third of the articles achieved power values equal to or greater than 0.80. Nine studies attained average power estimates of 0.95 or greater. These investigators had an approximately equal chance of committing an alpha error over a beta error (B = [I - 0.951 = 0.05; a = 0.05). Only one of the studies sampled achieved this one-to-one ratio for a small effect size, and three studies attained power values greater than or equal to 0.95 when a medium effect size was posited.

244

ROBERT

M. KROLL

and LAWRENCE

J. CHASE

Interdisciplinary Comparisons Table 4 presents the data from previous power analyses conducted in other disciplines, along with the results of the present power analytic evaluation of research in communication disorders. It is interesting to note the similarity of results in each of the above power analyses. The average power figures are similar for small, medium, and large effect sizes among those disciplines that have been evaluated. It thus appears that low statistical power, and the consequent high probability for Type II error, is not at all unique to research in the area of communication disorders. Implications for Future Research Type ZZ error. Inadequate statistical power should alert not only speech pathologists and audiologists, but other behavioral scientists as well, to several research problems. The major problem associated with low statistical power concerns the high probability of beta errors. True, those studies that reported significant results (excluding factorial ANOVA designs) were relatively unaffected. However, it should be noted that only the published research in communication disorders was examined. According to Cohen (1962): It seems obvious that investigators are less likely to submit for publication unsuccessful than successful research, to say nothing of a similar editorial bias in accepting research for

Interdisciplinary

Comparisons:

TABLE 4 Power in Psychology, Education, tion Disorders Sample size Articles/tests

Discipline (Investigator)

Communication,

and Communica-

Small effect

Medium effect

Large effect

Psychology (Cohen,

1962)

70

2088

0.18

0.48

0.83

Education (Brewer,

1972)

47

373

0.14

0.58

0.78

Communicatiolh (Chase and Tucker, 1975)

46

1298

0.18

0.52

0.79

Communication Disorders (Kroll and Chase, 1975)

62

1037

0.16

0.44

0.73

(,Katzer and Sodt’s (1973) analysis included only one communication journal, while Chase and Tucker (1975) examined nine communication journals. Thus, the latter analysis was selected on the basis of representativeness.

POWER

publication. publication, of current

ANALYTIC

ASSESSMENT

OF RECENT

RESEARCH

245

. if anything, published studies are more powerful than those which do not reach certainly not less powerful. Therefore, the going antecedent probability of success research is much lower than one would like to see it. . (pp. 152-153)

Type I error. Moreover, these research practices also serve to inflate the Type I error rate of published studies. Cohen’s ( 1969) characterization of a psychological experimenter applies equally to researchers in communication disorders: Consider our current practice. If Dr. Doe is tyrranized by the five per cent (Ylevel, accepts it passively, does the experiment, and fails to reject the null hypothesis (of which there is a 0.40 chance), there is a strong possibility that his report will never be published anyway. The reluctance of investigators to offer for publication of negative results is exceeded only by that of editors in regard to publishing them. . An incidental consequence of not publishing negative results is that thencruul rate of spuriously ‘significanr’ reports in published research is higher than conventional (Y levels. (p. 100, italics added)

If this is the case in communication disorders, then two important conclusions may be drawn: (1) many of our potentially heuristic research lines have been ignored due to insignificant results (which may have occurred due to inadequate power), and (2) some relatively unimportant research (in which the null hypothesis was falsely rejected) has stimulated further efforts, and has consequently diverted our attention from more meaningful analyses. Tests for interactions. Finally, inadequate statistical power may further contribute to misleading generalizations when factorial designs are employed. For example, consider the two-way analysis of variance, a commonly used statistical procedure.If a matrix consisting of four cells is constructed, with the rows representing two types of stutterers and the columns representing two methods of treatment, three outcomes (or permutations thereof) are possible. ( 1) one group of stutterers may perform better regardless of the treatment employed (main effect of A); (2) treatment 1 may be superior across both groups of subjects (main effect of B); or (3) treatment 1 may yield better results with one group, while the second method is shown to be superior with group 2 (interaction effect). When an interaction exists, as in case (3), the significant main effects cannot be interpreted as such. However, low statistical power may prevent tests for interactions from attaining significance. Therefore, inadequate power for tests for interactions should alert the reader immediately. This factor becomes even more salient when Cohen’s (1969, p. 368) admonition is recalled: “relatively poor power is the inevitable fate for tests of interactions.” Fortunately, three methods of handling this difficulty are available to the investigator (Cohen, 1969): The most obvious solution, increasing n per relevant cell, may be possible in some instances, but in many others would result in demands for experiments far beyond the resources of the experimenter. Another possibility worth considering is the reduction of the number of levels of one or more factors, thereby reducing the number of cells and increasing then per relevant cell. Finally, in an effort to avoid doing F tests on interactions which are of low power and hence

246

ROBERT

M. KROLL

and LAWRENCE

J. CHASE

yield ambiguous negative results, one should consider setting less stringent significance criteria for interaction tests, e.g., a = 0.10 insteadof 0.05. It may wellbe worth alarger Type1 error to bring Type I1 errors down to a tolerable level. (pp. 368-369)

Conclusion In conclusion, for the reasons of Type I and Type II error risks, interpretation of tests for interactions, and related considerations, statistical power warrants the attention of speech pathologists and audiologists. Specifically, statistical power should be of major concern in both the planning and analysis phases of an empirical study. Increased awareness of the consequences of inadequate power will serve to improve the type and quality of experimental, clinical, and/or theoretical implications that may be derived from an empirical investigation. This, in turn, will aid in the establishment and maintenance of an effective, complementary relationship between the experimentalist and clinician. Ultimately, the needs of the speech- and hearing-impaired population will be better served. Appreciation is extended to Professor Raymond K. Tucker of Bowling Green State University for his invaluable contribution to our methodological education. Professor Norbert H. Mills of the University of Toledo provided insightful criticism of an earlier version of this manuscript.

References R. On the role of chance in imitation-comprehension-production test results. J. Verbal Learning Verbal Behav., 1972, 11, 414-477. Beasley, D.S., Manning, J.I. Experimenter bias and speech pathologists’ evaluation of children’s language skills. J. Commun. Dis., 1973, 6, 93-101. Brewer, J. On the power of statistical tests in the American Educational Research Journal&r. E&c. Res. J., 1972, 9, 391-401. Campbell, D.T., Stanley, J.C. Experimental and quasi-experimental designsfor research. Chicago: Rand McNally and Co., 1963. Chase, L.J., Tucker, R.K. A power-analytic examination of contemporary communication research. Speech Monogr., 1915, 42, 2941. Cohen, J. The statistical power of abnormal-social psychological research: A review, J. Abnorm. Social Psychol., 1962, 65, 14S153. Cohen, J. Some statistical issues in psychological research. In B.B. Wolman (Ed.), Handbook of clinical psychology. New York: McGraw-Hill, 1965. Cohen, J. Sfafisticalpoweranalysisfor the behavioral sciences. New York: Academic Press, 1969. Cohen, J. Statistical power analysis and research results. Am. Educ. Res. J., 1973, 10, 225-229. Darley, F. Clinical training for full-time clinical service: A neglected obligation. ASHA, 1969, 11, 143-148. Duggan, T., Dean, C. Commonmisinterpretations of significance levels in sociological joumals.Am. Social., 1968, 3, 45-46. Edwards, A. Experimenfal design in psychological research. New York: Holt, Rinehart, and Winston, 1972, ed. 4. Eisenstadt, A. Weakness in clinical proceduresa parental evaluation. ASHA, 1972, 14, 7-9. Baird,

POWER

ANALYTIC

ASSESSMENT

OF RECENT

RESEARCH

247

Goldstein, R. Presidential address. 1971 National Convention. ASHA, 1972, 14, 3-6. Griffith, J., Miner, L. LCI reliability and size of language sample. J. Commun. Dis., 1969, 2, 264-267. Hamre, C. Research and clinical practice: a unifying model. ASHA, 1972, 14, 542-545. Katzer, J., Sodt, J. An analysis of the use of statistical testing in communication researcb.J. Commun., 1973, 23, 25 l-265. McNemar, Q. Psychological sfarisfics. New York: John Wiley & Sons, Inc., 1969, ed. 4. O’Neill, J. Commitment to a professional image. Presidential address. 1969 National Convention. ASHA, 1970, 12, S-6. Overall, J. Classical statistical hypothesis testing within the context of Bayesian theory. Psychol. Bull., 1969, 71, 285-293. Rosenthal, R., Gaito, J. The interpretation of levels of significance by psychological researchers. J. Psychol., 1963, 55, 33-38. Schultz, M., Roberts, W., Yairi, E. The clinician and the researcher: comments. ASHA, 1972, 14, 539-54 I. Starkweather, C. The case against base rate comparisons in stuttering experimentation. J. Commun. Dis., 197 1, 4, 247-258. Tversky, A., Kahneman, D. Belief in the law of small numbers. Psychol. Bull., 1971,76, 105-l 10.

Communication disorders: a power analytic assessment of recent research.

JOURNAL OF COMMUNICATION DISORDERS 8 (1975), 237-247 COMMUNICATION DISORDERS: A POWER ANALYTIC ASSESSMENT OF RECENT RESEARCH ROBERT M. KROLL Dir...
687KB Sizes 0 Downloads 0 Views