Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

63i

Perspective

L

‘..:?

Strategies Radiology Craig

for Improving Research

Power

in Diagnostic

A. Beam1

Research

studies

in diagnostic

radiology

often compare

the

improve

the power

of studies

that compare

imaging

tech-

diagnostic abilities of two imaging techniques. The “‘power” of such studies is the probability that they will detect a difference in abilities of a certain amount when, indeed, such a difference does exist. This article outlines several strategies that can be used to assess and improve the power of radiologic diagnostic studies. These strategies include selection of cases and controls, matching, use of one-tailed tests, selection of significance level, and choice of sample size.

niques. This familiarity will promote and facilitate interaction with statisticians at the design stage of a research study. The other aspect of a study’s reliability is the precision with which estimates of diagnostic ability are made. Although space limitations do not allow review of this other aspect of a study’s reliability, strategies that increase the power of a study often

A story told in statistical circles goes like this: There once was an investigator presenting his findings about the safety

diagnostic

of a new compound investigator reports

Strategies

also increase the precision of estimates. Table i presents some of the concepts used in the analysis of the power of

to a scientific gathering. At one point the to his audience, “Thirty-three percent of

I

June 24, 1991;

Department

of Radiology

accepted

after revision

and Division

February

of Biometry,

for Increasing

that the experiment will find a difference in the two sensitivities (or specificities) when, in fact, there is a difference of a certain amount to be found. The power of a diagnostic study depends to a large extent on its design and analysis. Accordingly, several strategies can be used to increase power via study design and analysis. Selection

of Cases

September

1992

0361 -.803X/92/1

Subjects

25, 1992.

Department

593-0631

and Control

One way to optimize the power to detect a difference in sensitivities is to select cases that will be neither too easy nor too difficult to diagnose. This method is based on the notion

of Community

and Family

Medicine,

27710. AJR 159:631-637,

Power

Quite often in diagnostic imaging research, studies are conducted to determine which of two techniques has better sensitivity or specificity. Power here, then, is the probability

the rats died within 24 hr after administration of the agent, and 33% survived at least 24 hr after administration. Unfortunately, the third rat got away.” The joke, of course, is the incredulity of findings based on so small a sample and the use of “statistification” to dress things up. The fact that almost everyone who hears this story appreciates its point shows that we all somehow understand that larger sample sizes give more reliable results and that, indeed, sometimes sample sizes can be too small to arrive at sound scientific conclusions. The reliability of a study can be viewed in two ways. In one, we consider the ability the study has to find something if, indeed, it is there to be found. This is the power of the study, and the purpose of this paper is to familiarize the radiologist with several strategies that can be used to assess and Received

studies.

© American

Roentgen

Ray Society

Box 3808,

Duke

University

Medical

Center,

Durham,

NC

BEAM

632

Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

TABLE

1: Concepts

Used in Power Analysis

Null hypothesis: The statement that the two imaging techniques are equivalent in their ability to diagnose or that one technique (the

“contender’) is no better than the other (the “reference”

tech-

nique).

p value: The probability

of observing

data as extreme

extreme than those observed in the study, assuming hypothesis is true. Significance rejection

level: The cutoff for deciding

which

p

or more the null

will detect

a difference

in

diagnostic abilities of a certain amount given such a difference actually exists.

that larger differences in diagnostic performance ought to be easier to detect than smaller differences and that tests can be made to look similar in their performance if the cases selected are all too easy or too difficult to diagnose. For example, inclusion of patients who have only large lesions will make sensitivities of both tests nearly 1 00% and, hence, very close. Similar concerns apply to the selection of control subjects for the comparison of specificities.

Study

Design

Another way to increase power is by choosing the most judicious study design and selecting the most sensitive method of analysis. Study design is often more important than the method of analysis because the latter depends on the characteristics of the data, which are often determined by the design of the study. For example, diagnostic studies in which both techniques are evaluated in the same subjects (“paired”

studies)

are always

at least as powerful,

and usually

more powerful, than studies in which separate groups of subjects are used for each technique [i]. This statement is true, however, only insofar as the method of analyzing the data makes good use of the additional information that is acquired through pairing. Hence, it is essential to choose the most sensitive statistical method appropriate to the study design.

of the Null Hypothesis

The null hypothesis is the position of the doubting Thomas. But, not only is the scientific Thomas a doubter, he also comes from Missouri, and the goal of the scientific experiment is often to acquire the evidence that shows the doubting Thomas he is wrong. In other words, the null hypothesis is the assumption that no differences exist until proved otherwise. In regard to studies comparing two diagnostic techniques, the null hypothesis is either the assertion of equality in abilities between the techniques or the assertion that one of the

1992

observed

in the study.

This probability

is computed

under the assumption that the null hypothesis is true. Largely by convention, data that have p values less than .05 are considered too extreme and lead to rejection of the null hypothesis. But, in fact, the cutoff used to define which p values are too small (the “significance level” of the hypothesis test) is arbitrary and can be selected by the researcher to be a value different from the conventional 5%. The next section discusses this strategy in greater detail. When the null hypothesis claims that the two diagnostic techniques are equivalent, more extreme data come from the performance by the contender that is either better or worse than the performance of the reference. As such, evidence against the null hypothesis is to be found in either extreme or “tail” of the probability distribution of the statistic used to compare the techniques. This type of hypothesis test has come to be known as a “two-tailed” test. On the other hand, if the null hypothesis states that the contender is simply no better than the reference, evidence contrary to this hypothesis can come only when the contender outperforms the reference. Hence, here the evidence against the null hypothesis comes from only one extreme or “tail” of the probability distribution of the statistic used to compare the techniques. This type of hypothesis test is known as a one-tailed test. One-tailed

hypothesis

tests

are usually

more

powerful

than

tests, and thus, another tool that we can use to influence power is our specification of the null hypothesis. When diagnostic techniques are compared, a one-tailed hypothesis test seeks to show superiority of one particular technique over another. Thus, one-tailed tests are used when we wish to show that a new technique (the “contender”) is two-tailed

better

than

a “reference”

technique.

In this

case,

the

null

hypothesis states that the contender is no better (could even be worse!) than the reference. If we reject this null hypothesis, we then conclude that the contender is, indeed, superior. On the other hand, if we simply seek to decide whether the two techniques differ diagnostically, then we would use a “twotailed”

Specification

September

techniques (the “contender”) is simply no better than another technique (the “reference” technique). The statistical method of hypothesis testing rejects the null hypothesis whenever the differences in the results of a study are too unlikely relative to the outcome expected by the null hypothesis. The “unlikeliness” of the data in relation to the null hypothesis is measured by the p value, which is the probability of observing data as extreme or more extreme than those

values lead to

of the null hypothesis.

Power: The probability that the study

AJR:159,

hypothesis

test

with

the null hypothesis

stating

that

they are equal. The investigator must be sure not to use this strategy indiscriminately, for there are situations in which the onetailed hypothesis test is not appropriate, as in those cases in which our interest is in finding the best diagnostic technique, whichever that might be. The one-tailed test fails to be adequate here because it can provide only evidence of the superiority of one particular technique over the other and cannot demonstrate superiority the other way around. Sometimes “weeding” out inferior techniques is as important as finding better ones, and in these cases the two-tailed test is required.

AJA:159,

Selection

Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

IMPROVING

September1992

of Significance

POWER

IN DIAGNOSTIC

Level

A fourth way to influence power is by the choice of the “significance level” of our hypothesis test. The significance level defines which p values will be considered “too small” and acts as a cutoff to decide whether or not to reject the null hypothesis. with this cutoff

The p value from the experiment is compared and, if it is smaller, we then declare the results

too extreme, decide against the null hypothesis, and conclude that the diagnostic techniques differ. Power and the significance level share an important relationship: as the significance level of a test increases, the power of the test increases. For example, consider increasing a significance

level from

5% to i 0%.

In the first

case

only

data that have less than a 5% (i.e., a one in 20) chance of occurring are considered too “unlikely” and lead to the rejection of the null hypothesis. In the second case, the data can be even more likely to occur (as likely as having a one in 10 chance of occurring) and will still be considered too “unlikely.” Thus, we have an easier time rejecting the null hypothesis in the i 0% case than in the 5% case and so have greaten power. Yet, although one might be tempted to use a larger significance level in order to increase the chance of finding something,

the other

side of the issue

is that the significance

level

is also the probability of rejecting a true null hypothesis; that is, of making the error of declaring there is a difference when in fact none exists. Hence, although we do gain power by increasing the significance level from 5% to 10%, we also increase the probability of making a false-positive type of finding in our study from 5% to i 0%. The 5% significance level (p < .05) currently exists in the literature

as the standard,

and so the use of a different

level

might arouse suspicion. Nonetheless, the technique of changing the significance level should not be overlooked as a means of increasing power when designing a study in those cases in which,

for example,

every

is crucial and the penalty sion is minimal.

Of course,

bit of power

that can be mustered

arising from a false-positive this alteration

must

conclu-

be done

at the

study design stage, because it is not ethical to adjust the significance level after the fact in order to yield statistically “significant” results. Finally, as is well known, one can influence the power of a study

via the choice

of sample

size.

However,

use of this

strategy to increase power requires a bit more in-depth consideration on our part. The next section discusses the selection

of sample

radiologic

size

diagnostic

Power and Sample and Specificities

to achieve

desired

levels

of power

Size

When

RESEARCH

633

applied to the same group of subjects would be called a “matched-groups” on, more specifically, a “pained” study. Aside from ethical considerations, each type of study has its own statistical considerations that influence the choice of sample size for obtaining a certain probability, or power, to detect differences between techniques. As will be seen later, these different considerations arise from different statistical approaches to the comparison of the techniques. A standard method for comparing sensitivities or specificities of techniques in unmatched-groups diagnostic studies is to assign separate samples of patients to be imaged by each of the two techniques and then summarize diagnostic per-

formance

in a table such as given in Figure i . Here we have

n1 patients being imaged by the reference technique and n2 by the contender. Proportions of correct diagnoses in the two

samples of patients can then be compared statistically by using the Yates continuity-corrected x2 test [2]. Important assumptions underlie this statistical test. If they are not met, the results obtained will be questionable. The first assumption is that each sample of subjects represents a random

sample

from

the same

population.

case, besides

invalidating

the statistical

any observed

differences

could

between the

the populations

diagnostic

assumption same from

If this is not the

properties

be due instead

of the test,

to differences

and not due to real differences

performance

of the

techniques.

in

A second

is that the probability of correct diagnosis is the patient to patient within each sample. This as-

sumption is largely met by randomly sampling each population. However, it does also require constancy and uniformity in imaging and image interpretation. For example, this statistical procedure will be invalid if learning accompanies the

interpretation. A final assumption is that the images are interpreted independently; that is, how one image is interpreted in no way influences the interpretation of another. Absence of either of these two latter assumptions invalidates the estimate of variability used to form the statistic and in the calculation

of its p value.

Casagrande and Pike [3] provide a formula to estimate the common sample size for each imaging technique required to obtain a desired level of power for the comparison of sensitivities (or specificities) in an unmatched-groups diagnostic study. This formula is presented in Table 2 and pertains to the one-tailed situation, in which we are interested in determining

only

if one test

(the contender)

is better

(e.g.,

more

in

studies.

+

Comparing

be called an “unmatched-groups” study. A study in which both

Reference

nl

Contender

fl

Sensitivities

An important initial consideration in the design of studies that compare the sensitivity on specificity of nadiologic techniques is whether or not both techniques will be applied to the same set of subjects. A study that uses separate groups of subjects for each imaging technique would, in standard terminology, ent-groups”

RADIOLOGY

or “independtechniques are

2

Fig. 1.-Typical summary of diagnostic performance in an unmatchedgroups study design: n1 patients have been imaged by the reference technique and n2 have been imaged by the contender. Letters in squares indicate numbers of patients. If all patients have the abnormality, then the sensitivity of the reference is estimated by a1/n1 and that of the contender by a2/n2. If all patients are control subjects, then the specificity of the reference is estimated by b,/n1 and that of the contender by b/n2.

Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

634

BEAM

AJA:159,

September

1992

TABLE 2: Sample Size Estimation When Comparing Sensitivities (Specificities) in the Unmatched-Groups Diagnostic Study

TABLE 3: Example Unmatched-Groups

Data: As in Figure 1

Suppose we want to design our study so that there is an 80% chance of detecting a difference in the two imaging techniques when the sensitivity of the reference technique is 80% and the sensitivity of the contender is 95%.

Test: x2 with Yates correction

(see Snedecor

and Cochran

Sample size: Using the method given by Casagrande the sample size for each group is estimated Ax[i

[2])

of Sample Diagnostic

Size Estimation Study

in the

and Pike [3],

to be

From Table 1 , the power factor (PF) for 80% power is found to be 0.840. Also, for a 5% significance level, the significance level factor (SLF) is 1.645.

+jl +4x(P2-P1)/A]2 4 x (P2 P1)2 -

where: P1 sensitivity P2 sensitivity =

A

=

(specificity) (specificity)

The following quantities 0.875, Qi 0.20, =

of the reference technique, of the contender,

[SLF x (2 x P x

)#{189}

PF x (P1 x Q

+

(P1

-

+

A P2 x

+ 0.840

n

2 01

=

02

1

-

Pi, and

1

-

P2.

Significance Level Factor (SLF)

1 5 10

sensitive although developed

2.325 1.645 1.280

equally

Power (%)

Power Factor (PF)

99 95 90 80 70

2.325 1.645 1.280 0.840 0.525

to the analysis

of specificity

as well.

Also note that none of the formulas introduced here can be used to compare diagnostic accuracy or predictive values, as these quantities require the additional estimation of prevalence rates, which is not typically done in these studies. As can be seen from Table 2, the formula

from Casagrande

and Pike requires specification of: (a) the significance level of the hypothesis test, (b) the desired power, and (c) the sensitivities of the two techniques. The last item required for specification shows that sample size determination depends on subjective appraisal and prior knowledge,

because

to use this formula

we must

be able to

suggest values for the sensitivities of the two techniques that are not only plausible but also represent the smallest clinically important difference. Obviously, to consider implausible values is a waste of time and to consider differences too small to be of practical interest is a waste of resources. Table 3 provides

and Pike formula

an example

to estimate

of the use of the Casagrande

sample

=

=

or specific) than another test (the reference). Also, sensitivity is used in the examples, everything so applies

=

x (2 x 0.875 x 0.125)#{189} x (0.8 x 0.2

size. Here we suppose

+

0.95 x

0.05)v]2

sample size is 1.327 [1+ i

+

4(0.15)/i

4(0.1

Factors: The following factors associated with the significance level and power are used in the formulas presented in this paper for estimating sample size:

Significance Level (%)

=

1.327

=

Then, the estimated

+

(Q1

are required for the formula in Table 2: 0.05, 0.125, and

02

[1.645

=

Q2)#{189}]2,

+

2 -

=

.3272]2

5)2

71.7

As we cannot have a fraction of a patient, we should round this estimate up to ensure our study has 80% power.

we wish to compare two techniques in which, from our experience, we expect the reference technique to have approximately 80% sensitivity. Also suppose that we are interested in the contender only if its sensitivity exceeds 95% (only then will the cost of this new procedure be worthwhile to recommend it in place of the reference). Also, suppose we

wish to have at least 80% power

in picking

difference between the two techniques. find we would need about 72 patients

up this type of a

From the formula, we per technique (i.e., i44

patients total) to have an 80% chance improvement oven the reference.

of finding

such an

Certainly this is not an encouraging result from the standpoint of practical study design. For, how often is it reasonable

to expect total sample sizes to exceed i 00 in those ubiquitous situations in which ethical and economic considerations severely restrict enrollment of patients? Fortunately, the matched-groups design offers an alternative that usually provides greater power per sample size than afforded by the unmatched-groups

study.

In the matched-groups imaging study each patient is imaged by both techniques. A radiologist then gives two diagnoses for each patient based on separate interpretations of the two images. The resulting data from n patients can then be presented as in the cross-tabulation illustrated in Figure 2. An important piece of information obtainable from this table, and the reason this table should be presented whenever reporting diagnostic results from matched-groups studies, is the patterns of agreement and disagreement between the two techniques. For example, from this table we see that in (a + d) patients of the n patients the contender agreed with the reference. Supposing that all n patients had the abnormality (i.e., positives or +), then we would also observe that in the cases

of disagreement,

b patients

were

correctly

clas-

AJR:159,

September

IMPROVING

1992

POWER

IN DIAGNOSTIC

Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

+

+ -

a

b

C

d

a+b

a+c

RESEARCH

635

of values possible for this probability. Table 4 provides a formula for the limits on the probability of disagreement. Table 5 provides an example. For example, suppose as before that our reference test has 80% sensitivity and that we wish to detect an improvement in the contender at least as great as 95%. On the basis of these values, we know from

Reference

Contender

RADIOLOGY

n TABLE

Fig. 2.-Typical

summary of diagnostic performance in a matchedstudy design: n patients have been imaged by both techniques. Letters in squares indicate numbers of patients. If all patients have the abnormality, then the sensitivity of the reference is estimated by (a + c)/n and that of the contender by (a + b)/n. Difference in sensitivities would be estimated by (b - c)/n. If all patients are control subjects, then the specificity of the reference is estimated by (b + d)/n and that of the contender by (c + d)/n. Difference in specificities would be estimated by (b - c)/n.

4:

Sample

Sensitivities Diagnostic

Size

Estimation

(Specificities) Study

When

Comparing

in the Matched-Groups

groups

2

Data: As in Figure

Test: McNemar Sample

size:

size for the

(see Dwyer [1])

Using entire

the method given study is estimated

[SLF x (‘I’

n=

sified by the contender and incorrectly by the reference. Similarly, c patients were correctly classified by the reference and not by the contender. If all n patients indeed have the abnormality, then the sensitivity

of the reference

is estimated

by (a

+

c)/n,

the

sensitivity of the contender is estimated by (a + b)/n, and the difference in sensitivities is then estimated by (b c)/n. Hence, when we compare the sensitivities of these two techniques, the only useful information we have in deciding between the two resides in the instances of their disagreement. Furthermore, if the techniques really are equal, then we should expect disagreement to occur solely by chance with there being an

where, t5

nique,

chance

of disagreement

possible

in either

in which high power

groups

as in the unmatched-groups

case,

the matched-

design also requires specification of the probability of the tests. In fact, our choice of sensi-

disagreement between tivities (or specificities)

for the two tests

PF x (‘I’

-

probability of disagreement between the techniques, Pi , Pi = sensitivity (specificity) of the reference techP2 = sensitivity (specificity) of the contender, SLF = signifi=

power factor (see

=

of Disagreement

(‘I’):

values

for P1 and P2, the minimum probability of disagreement equals P2 - P1. For practical purposes, the maximum probability of disagreement is when agreement occurs solely by chance. In this case, the probability of disagreement equals P1 x (1 - P2) + (1 - P1) x P2.

TABLE

5: Example

Matched-Groups As

of Sample

Diagnostic

Size Estimation

in the

Study

in Table 3, suppose we want to design our study so that there is an 80% chance of detecting a difference in two imaging techniques when the sensitivity of the reference technique is 80% and the sensitivity of the contender is 95%. As in the earlier example, PF 0.840 and SLF 1 .645. From Table 4, we see that the lowest possible probability of disagreement between the two techniques is 0.95 0.80 0.15. The greatest probability of disagreement is =

=

-

=

0.80 x (0.05) Assuming

the

lowest

possible

+

(0.20) x 0.95 probability

=

0.23.

of disagreement,

our

esti-

mated sample size is -

n

[1.645

x 0.15””

+

0.840 x (0.15 0.152

-

02

,152)’/]2 ‘

might be obtained

from small sample sizes are unlikely to occur in radiologic research and can be ignored. Table 4 summarizes this formula. In addition to specifying plausible values for the two tests’ sensitivities

sample

direction

(either b or c in Figure 2). Assuming this null hypothesis to be the case, we would expect to observe about equal values for b and c in the cross-classification table when the techniques are equal. This is the idea behind the McNemar test. The McNemar test is used for the analysis of the matchedgroups imaging study and assumes that the n patients are a random sample from some population. As before, this study design also assumes the sensitivity or specificity of each technique is constant. Violations of these assumptions can compromise the reliability of this statistical procedure. Further discussion about the use of the McNemar test in radiologic research can be found in Dwyer [11. Conner [4] provides a formula that can be used to estimate sample size in the matched-groups study. Although this formula can be a bit too optimistic by underestimating the sample size, those rare cases

the

-

Bounds on the Probability Given

[4],

to be

cance level factor (see Table 2), and PF Table 2).

-

equal

‘I’ P2

=

+

by Conner

determines

the range

up to ensure 80% power, we would estimate 40 patients would be required if there is the lowest possible level of disagreement between the two techniques.

Rounding

If we assume the two techniques agree only at the level expected by chance, then, using 0.23 as the probability of disagreement, we would estimate that 62 patients would be required to ensure an 80% chance of detecting this particular difference between the two techniques.

636

BEAM

AJR:159,

Fig. 3.-Example

Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

**

This of a with

ESTIMATED

TOTAL

NUMBER

OF PATIENTS

that

*5

65.00 106. 70.00 116. 75.00 130. 80.00 144. 85.00 164. 90.00 190. 95.00 232. (Use PRINTSCREEN If you want to run SSADSIR. Press to continue.

LOW (

48./group) C 53./group) ( 58./group) C 65./group) ( 72./group) C 82./group) ( 95./group) ( 116./group) button to print

MED

24. 27. 31. 35. 40. 46 54. 67. this analysis}

of disagreement

between

the two tests.

One can

readily see from this example the efficiency (and economy) that can be obtained through the use of matched groups in the study design. However, also note that although the savings to be had is in the number of patients imaged, it is not necessarily in the number of images to be interpreted. In the unmatched-groups study, 144 total images had to be interpreted. But, although fewer patients needed to be enrolled in the matched-groups study, if the probability of disagreement is “high,” then 124 images (two from each patient) will have to be interpreted. Thus, ignoring the issue of patients, the major benefit to be had from the matched-groups design comes when the chance of disagreement between the tests is likely to be low. In fact, as seen in our example,

size

from a program (SSADSIR).

In the

with a conThe output shows sample size estimates for both unmatchedand matched-groups studies that are required to obtain powers ranging from 60% to 95%. The program also reports the lowest and highest probabilities of disagreement possible

and reports sample sizes for these extremes as well as for a probability of disagreement halfway between the extremes.

HIGH

30. 34. 39. 44. 51. 58. 69. 86.

37. 42. 47. 54. 62. 71. 84. 106.

.

Table 4 that the probability of disagreement ranges from a low of 15% to a high of 23%. If we assume the lowest probability of disagreement between the two techniques, then from the formula we would estimate that about 40 patients would be required to achieve 80% power. On the other hand, if we assume the highest probability of disagreement (which, for practical purposes, occurs when the techniques agree solely by chance), we would estimate that about 62 patients would be needed. Compare these sample size requirements with those from the unmatched-groups design. There we estimated 144 patients would be required to achieve 80% power. Thus, in going to a matched-groups design we save at least 82 patients in the situation of high probability of disagreement and up to 1 02 patients in the very plausible situation of low probability

of output

sensitivity and is being compared tender that has 95% sensitivity.

when using a 5% Significance Level. Based upon the values you selected: The LOWEST possible probability of disagreement is .150 The highest probability of disagreement considered by this analysis occurs when the techniques agree solely by chance. For your values this HIGHEST probability of disagreement is .230 SSADSIR has used .190 as a MEDIUM amount of disagreement MATCHED STUDY UNMATCHED Probability of Disagreement

STUDY 96.

sample

example given, the reference technique has 80%

power analysis is for the comparison technique having 95.0% sensitivity/specificity a REFERENCE technique having 80.0% sensitivity/specificity

POWER 60.00

estimates

September1992

if the chance

of disagreement

is as

low as 15%, then we require 80 images (since n = 40) to be interpreted-still a big savings over the requirements of the unmatched-groups design.

SSADSIR

Figure 3 shows output from an interactive FORTRAN program that does the sample size estimation described in this article. This program (called SSADSIR for Sample Size Analysis of Diagnostic Studies in Radiology) asks the user to enter values for the sensitivity or specificity of the reference technique and that of the contender. The program then prints a table of sample-size estimates for both unmatchedand matched-groups studies to obtain powers ranging from 60% to 95%. The program also reports the lowest and highest probabilities of disagreement possible for the set of sensitivities or specificities specified by the user, and reports sample sizes for matched-groups studies using these extreme values as well as sample size using a value of disagreement halfway between the extremes? The example shown in Figure 3 continues the earlier examples with the reference technique having 80% sensitivity and the contender having 95% sensitivity. The sample sizes required for 80% power are the same as in the earlier exampIes. We also see in this output sample sizes for other powers. Thus, we observe that increasing our desired level of power increases our sample-size requirement. For example, going from 80% power to 95% power will cost us an additional 88 patients in the unmatched study. The same increase in power, however, costs only an additional 27 patients in the matchedgroups study when the disagreement between the techniques is at the lowest possible level.

* A free copy of the SSADSIR program will be provided upon receipt of a formatted diskette and mailer. Please send requests to: SSADSIR, Section of Imaging Research Statistics, Box 3808, Department of Radiology. Duke Uni-

versity Medical Center, Durham, NC 27710.

AJR:159,

IMPROVING

September1992

POWER

IN

DIAGNOSTIC

I thank

I reviewed

several

when designing cessful

637

RESEARCH

ACKNOWLEDGMENTS

Summary

Downloaded from www.ajronline.org by NYU Langone Med Ctr-Sch of Med on 07/20/15 from IP address 128.122.253.212. Copyright ARRS. For personal use only; all rights reserved

RADIOLOGY

outcome.

strategies

a study Methods

for

to optimize for

an investigator

the chances

estimating

required

to use

of a sucsample

size were reviewed for study situations and designs commonly reported in the radiologic literature. Other methods are available (see, for example, Snedecor and Cochran [2] and Cohen [5]). The accuracy of estimates of sample size depends on how closely the required assumptions are met. Investigators should be aware of these assumptions. Failure to meet these assumptions does not eliminate the possibility of doing the investigation. Alternative procedures with their own methods to estimate sample size often exist for the statistical procedunes I have described. In these cases, the investigator should consult with a statistician.

gesting

Mark this

E. Baker,

paper

and

Duke for

University

helpful

Medical

discussions.

Sullivan, Susan Paine, and the other reviewers their suggestions.

Center, I also

for thank

of this manuscript

sugDan

for

REFERENCES 1 . Dwyer

AJ. Matchmaking

and McNemar

modalities. Radiology 1991;178:328-330 2. Snedecor GW, Cochran WG. Statistical State University Press, 1980: 124-125, 3. Casagrande JT, Pike MC. An improved sample sizes for comparing two

in the comparison methods,

of diagnostic

7th ed. Ames: The Iowa

129-1 30 approximate formula binomial distributions.

for calculating Biometrics

1978;34:483-486 4. Conner RJ. Sample size for testing differences in proportions for the paired-sample design. Biometrics 1987;43: 207-211 5. Cohen J. Statistical power analysis for behavioral sciences. Hillsdale, NJ: Eribaum Associates, 1988

Strategies for improving power in diagnostic radiology research.

Research studies in diagnostic radiology often compare the diagnostic abilities of two imaging techniques. The "power" of such studies is the probabil...
1MB Sizes 0 Downloads 0 Views