Receiver operating characteristic curves: a basic understanding.

Receiver Operating Characteristic Curves: A Basic Understanding .

1

.

DavidJ.

Vining,

MD

Gregory

W. Gladish,

BS

Receiver operating characteristic (ROC) is measurement that can be used to compare gies against human observer performance radiologist) Use of ROC curves allows one of radiologic interpretations when calculating ity for an imaging modality and avoids the assuming that imaging findings are absolutely

one form of an objective newer imaging technolo(the ability of the expert to account for a continuum sensitivity and specificinaccuracies that arise from normal or abnormal. An ROC curve is generated by plotting sensitivity on the y axis as a function of 1 specificity] on the x axis for a continuum of diagnostic criteria. ROC curves allow visual analysis of the trade offs between the sensitivity and the specificity of a test with regard to the variable diagnostic criteria used by radiologists. Because ROC curve analysis is gaining wide acceptance in medical literature, an explanation of ROC methods with the use of simple examples is necessary to increase the knowledge and understanding of practicing radiologists. .

[

-

INTRODUCTION

U

As imaging

technologies

rapidly

emerge,

radiologists

need

an

objective

means

to

compare the efficacy of these newer techniques with more established methods. Receiver operating characteristic (ROC) is one form of an objective measurement which the newer techniques are compared with human observer performance ability

of the

modalities

expert in

radiologist).

1960,

their

Since

application

ROC has

curves

were

continually

first

applied

expanded

in (the

to radiologic

(1-4).

For radiologists, ROC curves provide information for an objective comparison of imaging techniques to determine, for example, whether computed tomography (CT) on magnetic resonance (MR) imaging is better for the detection of liver metastases. For hospital administrators, ROC curves allow, for example, a comparison to be made of the performance of imaging systems before making major investments. For radiology residents, questions regarding ROC curves are included in the written

Abbreviations: TNF = true-negative Index

terms:

FNF

Images,

RadloGraphics

1992;

I From the Department of Merit for a scientific

and

receivedjune

92)-RD-A-16 ogy, C

from

TheJohns

RSNA,

24; the Hopkins

= false-negative fraction, TPF

=

FPF = false-positive fraction

fraction, true-positive

#{149} Receiver

interpretation

operating

fraction,

characteristic

ROC

curve

=

receiver

operating

characteristic,

(ROC)

12:1147-1154 of Radiology, exhibit at the acceptedJune State

30.

of Louisiana

Hospital,

Louisiana State University Medical Center, Shreveport, La. Recipient of a Certificate 199 1 RSNA scientific assembly. Received March 17, 1992; revision requested April

600

Supported Board

N Wolfe

in part

of Regents. St. Baltimore,

by a Louisiana Address MD

reprint

Education requests

Quality to DJ.V.

Support

Fund

, Department

grant

9

(1990-

of Radiol-

21205.

1992

1147

1.0

examinations

for board certification (5). Examples of ROC curve utilization are certain to be found in the current literature and at scientific assembly presentations. When radiologists are confronted with two or more ROC curves representing different diagnostic tests or imaging modalities (Fig 1), the better modality must be determined. This article tion,

presents and

the

use,

interpretation

the idea of helping their meaning and

anatomy, of ROC

construccurves

C 0

(a U0 0.5

. U)

0 0.

I-

with

radiologists understand their limitations. 0

USE OF ROC CURVES In everyday medical practice, mistakes occur when clinicians attempt to determine the presence or absence of a disease on the basis of the findings from a single imaging study.

0.5

U

Similar

inaccuracies

occur

without

disease

who

have

normal

ings. Quoting single sensitivity and specificity values for a diagnostic test (or imaging modality) to determine its overall usefulness can be ambiguous by failing to account for the wide variability

of interpretations

sents

Thallium

find-

used

1.0 Fraction

1. Typical ROC curves. Which the better diagnostic modality?

Figure

in radiology

practice when sensitivity and specificity are used to describe the findings from a radiologic examination with an assumption that there are only two possible interpretations: normal or abnormal. Sensitivity refers to the percentage of patients with disease who have abnormal findings at examination, whereas specificity describes the percentage of patients

False-Positive

(Shod

Axis

Stress

repre-

Test

of the Left Ventriclej

View

Borderline

Normal

curve

Definitely

Abnormal

*44 0% Wall

Figure thallium

50%

Defect

2.

Wall

A single stress test.

100%

Defect

Wall

Defect

criterion is used to judge A ventricular wall defect

greater than 50% is considered one less than 50% is normal.

abnormal,

the

whereas

by radiolo-

gists.

For example, the thallium stress test is used to distinguish between healthy people and those with heart disease. To illustrate the problem of applying a single criterion, thaIhum stress test results will be considered abnormal if a ventricular wall defect is greater than 50% and normal if the defect is less than 50% (Fig 2). If the results of testing each member in the diseased and healthy groups are compared with

the

actual

the thallium of 84% and signify that

occurrence

of heart

disease,

stress test might have a sensitivity a specificity of 94%. These terms 84% of people with disease will

have positive findings and 94% of people without disease will have negative findings (Fig 3). As these results show, the thallium stress test is not a perfect test to use for determining if a patient does or does not have disease. For the particular by the radiologist, heart

divided

into

the

diagnostic healthy

a true-negative

criterion population fraction

and a false-positive fraction the population with disease true-positive fraction (TPF) tive fraction (FNF) (Fig 4). If a radiologist “underreads”

used is (TNF)

(FPF), whereas is divided into and a false-negafindings

a

and

considers them abnormal only if they are definitely abnormal, the examination will have low sensitivity but high specificity. Those who

1148

U

RadioGrapbics

U

Vining

and

Gladish

Volume

12

Number

6

‘‘‘‘‘‘I,,, ‘‘‘‘‘‘I,,, ‘‘‘I,,,,,, ‘‘‘‘I,,,,, ‘‘‘‘‘‘I,,,

******

I,,

,

I,,,

tI NORMAL

GROUP

94%

FPF

TNF

=

‘P TN 9

,,

,

f

‘P

t_

HEART =

6%

DISEASE

TPF

84% Tp

Tru#{149} N.gatlv.

FP

Fals.

FN

.

Posltlv#{149}

GROUP FNF

=

=

Figure 3. determine

16%

Tru#{149} PosItIvi

lium

Fals#{149} Nsgatlv.

sensitivity

stress

With one criterion to the outcome of a thal-

test,

findings

of84%

and

will have

a

a specificity

of 94%.

curves

TruePositive Fraction

=

TrueNegative Fraction

=

FalsePositive Fraction

=

1

-

Specificity

FP TN+FP

=

1

-

Sensitivity

FN TP+FN

Sensitivity

TP+FN Specificity

specificities

TN

+ FP

4. Chart shows the divisions ofthe popuaccording to diagnostic criteria used by radiologists. FN = number of examinations with falsenegative findings, FP = false-positive findings, Thy = true-negative findings, 77’ = true-positive Figure lation

findings.

“overread”

findings

use

less

strict

criteria

ANATOMY OF AN ROC CURVE The general form of the ROC curve is shown in Figure 5. In a mathematic sense, an ROC curve is a curvilinear graph that plots the TPF on the y axis as a function of the FPF on the x axis for a range of diagnostic criteria. The TPF equals the percentage of patients with disease who have positive findings, whereas the FPF equals the percentage of patients without disease who have positive findings. A TPF and an FPF are calculated for each diagnostic criteiion (range, definitely abnormal to definitely normal) and are plotted to form an ROC U

False-

Negative Fraction

help by allowing visual analysis offs between the sensitivities and of a variety of diagnostic criteria

used by radiologists.

Th

=

provide

of the trade

TP

=

and

consider more findings abnormal, thus resulting in high sensitivity but low specificity. In reality, radiologists report findings with a continuum of responses ranging from definitely abnormal to equivocal to definitely normal due to the subjectivity and the bias of the individual radiologist. Often this is referred to as the radiologist’s “hedge.” As the diagnostic criteria (biases) of the radiologist vary, sensitivity and specificity of the test or imaging modality for a particular diagnostic criterion also

curve. A move

toward

equivalent

tic criteria

outcome

the

to the

use

right

on an ROC

ofless

(overreading)

stringent

curve

for determining

of an examination,

thereby

is

diagnos-

the

increas-

ing sensitivity but decreasing specificity. A move toward the left on an ROC curve, or the use of more strict criteria (underreading), increases specificity but decreases sensitivity of

the diagnostic

test. The upper

an ROC curve 0%) is formed

(sensitivity, 100%; when all findings

right corner

of

specificity, are read as

vary.

The modality

sensitivity and specificity of an imaging in relation to this continuum of interpretations are difficult to describe. ROC

November

1992

V#{149}mingand

Gladish

U

RadioGraphics

U

1149

Overre5d%

1.0

Moat

C

Probably

Sensitive

Normal

Point

Equivocal

0 U (0 U-

a) >

:

0.5

(I)

0 0

d)

Figure

5. Anatomy of an ROC curve. A toward the right on an ROC curve is

move equivalent to overreading the findings, thereby increasing sensitivity but decreasing specificity. reading)

A move increases

toward the left (underspecificity but decreases

of the diagnostic

sensitivity

0 0.5

test.

1.0

False-Positive

Fraction

1.0

1.0

C

0

13 (a U0

>

0.5

0.5

.

(a 0 0. 0

0

Cl,

I-

0.0 0

0.5

False-Positive

Fraction

b.

Figure curve Except

6. Plots show classic ROC (a) and “new age” ROC (b) curves. can be reoriented (1.0 at the origin and 0.0 at the night corner) for labels, the ROC curves are unchanged.

abnormal. curve

The

lower

(sensitivity,

formed

when

left corner

0%;

all findings

By realizing

that

the

of an ROC 100%) is

specificity,

are read FPF equates

as normal. to

1 specificity] and by reversing the direction ofincreasing values on the x axis (1.0 at the origin and 0.0 at the right corner), the x axis becomes specificity. The y axis label, TPF, -

is the

same

as sensitivity.

With

new

labels,

these are now called “sensitivity-specificity curves,” as proposed by Brismar (6). Except for the labels, the ROC curves are unchanged (Fig

6).

U

CONSTRUCTION

with mined

with

U

Vining

and

Gladish

two

OF

an ROC known

by some

disease

the

curve.

ROC

experiment,

populations “gold”

and

AN

ROC

one

starts

of people standard):

other

(deter-

one

without

group

disease.

In

the case of imaging modalities, there will be two sets of images for each modality: one set of images with a known radiologic finding and the other without the finding. In the example of a thallium stress test, a radiologist might apply the following continuum of diagnostic criteria to determine the outcome of a patient’s examination: (a) defi50%

RadioGrapbics

The x axis of a conventional to create a sensitivity-specificity

CURVE To conduct

nitely

U

0.0

Specificity

a.

1150

0.5

1.0

1.0

abnormal

wall

defect,

findings

show

(b) probably

Volume

a greater

abnormal

12

than

find-

Number

6

CRITERION

: DEFINITELY

ABNORMAL

c ,,,‘P,,,,,

r

,,

iu

c:

r

1A!

f

u

1r r

‘P‘P‘P‘P‘P ‘P‘P‘P ‘P

‘P ‘P‘P‘P‘P‘P‘P‘P‘P ‘P ‘P ‘P‘P‘P‘P‘P ‘P‘P‘P ‘P ‘P ‘P‘P‘P‘P‘P ‘P‘P‘P‘P NORMAL

GROUP

Ut

tt***t* tt*t***

Ut

HEART

FPF2%

DISEASE

TPF

GROUP

57%

T

CRITERION

: PROBABLY

‘P’P’P’P’P’P ‘P i ‘P’P’P’P’P’P ‘P ‘P’P’P’P’P’P ‘P ‘P’P ‘P ‘P’P’P’P’P’P ‘P‘P’P’P ‘P’P’P’P ‘P’P ‘P‘P’P ‘P ‘P’P’P’P’P’P ‘P ‘P’P ‘P ‘P’P’P’P ‘P’P ‘P ‘P’P ‘P ‘P’P’P’P’P’P ‘P’P’P’P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P’P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P’P’P’P NORMAL FPF

GROUP 6%

ABNORMAL

t * * * * * , * * * * * t 1

* * * * * * * t * * * *

1992

Figure

8.

Probably

abnormal

cri-

greater

than

30%

wall

defect).

c c!

****t***t* *****t****

**********

HEART

DISEASE TPF

ings show a 30%-50% wall defect, (c) equivocal findings reflect a 10%-30% wall defect, (d) probably normal findings show a 1%-10% wall defect, and (e) definitely normal findings demonstrate a 0% wall defect. The probably abnormal criterion incorporates all findings that are probably abnormal plus all those that are definitely abnormal to classify the results of an examination as abnormal. Similarly, the equivocal and probably normal criteria incorporate their own collection of abnormal findings plus those in the more abnormal categories. In an ROC experiment, a set of images from patients with known diagnoses is read by cxperts (radiologists) as repeated series. In the first series, only a definitely abnormal crite-

November

n

A

7. Definitely abnormal criFindings are considered abwith a greater than 50% wall

terion. Findings are considered abnormal if they are either definitely abnormal on probably abnormal (a

.Ti #{241} r

Figure terion. normal defect.

=

GROUP 84%

non is applied to the a TPF and an FPF are

set of images from extracted. For the

which 5cc-

ond series, both definitely abnormal and probably abnormal criteria are applied and TPF and repeated

FPF are calculated. This process is for each of the subsequent levels. Finally, the collection of TPF and FPF values are plotted to form an ROC curve. To illustrate, assume the following values for the thallium stress test: (a) Definitely abnormal findings have a TPF of0.57 and an FPF of0.02 (Fig 7), (b) probably abnormal findings have a TPF of0.84 and an FPF of 0.06 (Fig 8), (C) equivocal findings have a TPF of

%‘ining

and

Gladish

U

RadioGraphics

U

1151

: EQUIVOCAL

CRITERION

‘P’P’P’P

tt*tttr

‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P

***t*t*

‘P’P’P’PccQ

Figure Findings

if they probably

greater

Figure

rion.

9.

Equivocal are

criterion.

considered

are definitely abnormal,

abnormal or equivocal

10% wall defect).

10. Findings

Probably normal are considered

FPF

(a

#{231}t

‘P ‘P ‘P ‘P ‘P

‘P’P’P ‘P’P’P ‘P’P’P ‘P’P’P ‘P’P’P NORMAL FPF

0.90 and an FPF of0.27 (Fig 9), and (d) probably normal findings have a TPF of 0.95 and an FPF ofO.54 (Fig 10). Values for TPF and FPF are not calculated for the definitely normal criterion (0% wall defect), since the radiologist blindly considers all examination findings abnormal at this level. This point is located at the upper right corner on the ROC curve (TPF, 1.0; FPF, 1.0). Plotting these values produces the ROC curve shown in Figure 11.

1 152

U

RadioGraphics

U

Vining

and

Gladish

‘P ‘P HEART

DISEASE TPF

: PROBABLY

L

c 4L

lt

l

ci

#{231}

:

#{231}L #{231} ? ci

t**t*tt***

27%

CRITERION

?

‘P ‘P

GROUP =

criteab-

normal if they are definitely abnormal, probably abnormal, equivocal, on even probably normal (a greater than 1% wall defect).

L

c?

NORMAL

abnormal,

than

‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P A1 J ‘P ‘P ‘P ‘P ‘P’P ‘P ‘P’P ‘P ‘P’P ‘P’P’P ‘P’P’P ‘P ‘P ‘P’P’P’P’P’P’P’P’P ‘P’P ‘P ‘P’P’P ‘P ‘P ‘P’P’P ‘P’P’P ‘P’P

?ci ‘P’P ? ‘P’P ‘P ‘P’P ‘P ‘P’P’P ‘P’P ‘P

NORMAL

t*t***t**t

‘P ‘P ‘P ‘P ‘P’P ‘P ‘P

‘P ‘P ‘P ‘P

*t**tt**t* HEART

54%

DISEASE TPF

INTERPRETATION

OF

To compare two or more an ROC curve is produced (Fig

* * * * *

?cLil

GROUP

U

GROUP

90%

=

12).

If a radiologist

GROUP

95%

ROC

CURVES

imaging modalities, for each modality resorts

to a system

of

flipping a coin to make diagnoses (random guessing), 50:50 distributions oftrue positive/false negative and true negative/false positive are found in the groups with and without disease, respectively (Fig 13). An ROC curve for random guessing consists of a straight diagonal line (curve C). If a perfect diagnostic test (or imaging modality) exists that completely separates two groups of people (with and without disease)

Volume

12

Number

6

1.0

C

TPF FPF

0

13

=

0.84

=

0.06

TPF

=

0.90

FPF

=

0.27

FPF

-

0.54

(a

U-

a)

.2

0.5 FPF

U)

=

0.02

0 0.

I-

0 0.5 False-Positive

1.0

Fraction

FIgure thallium

11. Plotting TPF and FPF values stress test ROC curve.

yields

the

Figure 12. Curve A represents an ROC curve obtained from a “perfect” diagnostic test, whereas curve C is produced from random guessing. Curve B resembles a more typical ROC curve.

1.0

C

0

13

Ce U-

(Random

Guessing)

0

.

0.5

U)

0 0. I-

0 0

0.5 False-Positive

without

any

false-positive

1.0 Fraction

or false-negative

findings (Fig 14), an ROC curve resembling curve A is created. Sensitivity and specificity equal 100% for all diagnostic criteria. In reality, an ROC curve for a diagnostic test or imaging modality more likely resembles curve B. As an ROC curve approaches the y axis (in the direction of curve A), the test or modality better approximates a perfect test and better

enables

one to distinguish between patients with disease and those without (finding is present or absent on image), with fewer falsepositive and false-negative findings over a

November

1992

continuum the answer is better U

of diagnostic criteria. Therefore, to the question ofwhich modality is system

A in Figure

1.

SUMMARY

Important

points

regarding

ROC

curves

in-

dude the following: erated by plotting function of Li continuum is confusing,

(a) An ROC curve is gensensitivity on the y axis as a specificity] on the x axis for a of diagnostic criteria. If the name the x axis can be reoriented and

Vining

and

Gladish

U

RadioGrapbics

U

1153

RANDOM

GUESSING

ft

,

L

:r

!A

c2? I i

I

IJJI:

I

ii

11

1:

(

1

I

. i.

i:

.

‘P’P’P’P’P’P’P’P’P’P ‘P’P’P’P’P’P’P’P’P’P *tt*

‘P’P’P’P’P’P’P’P’P’P ‘P’P’P’P’P’P’P’P’P’P ‘P’P’P’P’P’P’P’P’P’P Figure duces FPFs

13.

Random

guessing pro50:50 TPFs/FNFs and TNFs/ in the two populations.

NORMAL

FPF

negative

HEART

GROUP =

50%

DISEASE TPF

14. A perfect test comseparates the two groups any false-positive or false-

Figure pletely without

******

PERFECT

GROUP 50%

=

TEST

‘P‘P‘P‘P‘P‘P‘P‘P‘P‘P

findings.

‘P‘P‘P‘P‘P‘P‘P‘P‘P‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P ‘P’P’P ‘P’P’P ‘P’P ‘P ‘P’P’P ‘P’P’P ‘P’P ‘P’P’P’P’P’P’P’P’P’P ‘P ‘P’P’P ‘P’P’P ‘P’P ‘P ‘P’P’P ‘P’P’P ‘P’P

‘P ‘P ‘P ‘P ‘P ‘P

‘P‘P‘P‘P‘P‘P‘P‘P‘P‘P NORMAL FPFO%

an ROC

thought of as a sensitivity-speci(b) A move toward the right on an ROC curve is equivalent to using less stringent diagnostic criteria (overreading), thereby increasing sensitivity but decreasing specificity. A move toward the left (underreading) increases specificity but decreases sensitivity of the diagnostic test. (c) The ROC curve closest to the y axis represents the best diagnostic test because it approaches a perfect test, with fewer false-positive and false-negative findings. ficity

curve

curve.

DISEASE

HEART

GROUP

TPF

U

1.

Goodenough

4.

K, Lusted

LB.

Ra-

of receiver

(ROC)

curves.

operating Radiology 1974;

110:89-95. Lusted LB. Logical analysis in roentgen diagnosis. Radiology 1960; 74:178-193. Metz CE. ROC methodology in nadiologic imaging. Invest Radiol 1986; 2 1:720-733. Turner DA. An intuitive approach to receiver operating characteristic curve analysis. J NucI

Med

5.

DJ, Rossman applications

characteristic

3.

100%

=

REFERENCES diographic

2.

GROUP

McNeil

1978; BJ.

19:213-220. Case

15: decision

analysis

I. In:

Siegel BA, ed. Nuclear radiology N test and syllabus. Reston, Va: American College of Radi6.

ology, Brismar

1990; 299-307.

J. characteristic

Understanding

curves: 1991; 157:1119-1121.

1 154

U

RadioGrapbics

U

Vining

and

Gladish

a graphic

Volume

receiver-operatingapproach.

12

Number

AJR

6

Understanding receiver-operating-characteristic curves: a graphic approach.

Anthropometric measurements as a screening test for carpal tunnel syndrome: receiver operating characteristic curves and accuracy.

Receiver operating characteristic curves in the evaluation of hard copies of computed tomography scans.

Lognormal Lorenz and normal receiver operating characteristic curves as mirror images.

How to read a receiver operating characteristic curve.

Comparison of semiparametric receiver operating characteristic models on observer data.

Editorial: Radiology and the receiver operating characteristic (ROC) curve.

Receiver operating characteristic curve generalization for non-monotone relationships.

Therapy operating characteristic curves: tools for precision chemotherapy.

RE: "RECEIVER OPERATING CHARACTERISTIC CURVE INFERENCE FROM A SAMPLE WITH A LIMIT OF DETECTION".

Cephalometrics of anterior open bite: a receiver operating characteristic (ROC) analysis.

Evaluation of a digital workstation for interpreting neonatal examinations. A receiver operating characteristic study.

A Linear Regression Framework for the Receiver Operating Characteristic (ROC) Curve Analysis.

The average receiver operating characteristic curve in multireader multicase imaging studies.

Receiver operator characteristic (ROC) curves and non-normal data: an empirical study.

Fully non-parametric receiver operating characteristic curve estimation for random-effects meta-analysis.

Evaluation of the predictive performance of nutritional indicators by receiver-operating characteristic curve analysis.

Binary classification using multivariate receiver operating characteristic curve for continuous data.

Jackknife variance of the partial area under the empirical receiver operating characteristic curve.

Weighted Area Under the Receiver Operating Characteristic Curve and Its Application to Gene Selection.

Receiver operating characteristic curve estimation for time to event with semicompeting risks and interval censoring.

Effects of 'matched filter' smoothing as measured by receiver operating characteristic curve.

Determination of optimum cutoff levels of plasma des-gamma-carboxy prothrombin and serum alpha-fetoprotein for the diagnosis of hepatocellular carcinoma using receiver operating characteristic curves.

CT.