Int J Biomed Comput. 31 (1992) 117-126

117

Elsevier Scientific Publishers Ireland Ltd.

MAXIMUM LIKELIHOOD ESTIMATION OF SIGNAL DETECTION MODEL PARAMETERS FOR THE ASSESSMENT OF TWO-STAGE DIAGNOSTIC STRATEGIES

ROLAND0

BISCAY LIRIO. ISMAEL CLARK DONDERIZ and M.C. PEREZ ABALO

Cuban Neuroscience Center, POB 6880, Havana (Cuba)

(Received February 1lth, 1992) (Accepted March 26th, 1992)

The methodology of Receiver Operating Characteristic curves based extended to evaluate the accuracy of two-stage diagnostic strategies. A for the maximum likelihood estimation of parameters that characterize two-stage classifiers according to this extended methodology. Its use is lected in a two-stage screening for auditory defects.

on the signal detection model is computer program is developed the sensitivity and specificity of briefly illustrated with data col-

Keyworrls: Diagnostic accuracy; Signal detection theory; Multistage classifier; Maximum likelihood estimation

IlltdUCtlOll

The methodology of ROC (Receiver Operating Characteristic) curves based on the signal detection model has been extensively used for several years now in the assessment of diagnostic procedures [l-4]. The classical methodology assumes a two-alternative diagnostic situation, i.e. each subject is classified as either abnormal or normal. In this case a false positive (respectively, true positive) classification results when a normal (abnormal) subject is assigned to the abnormal population. It is also assumed that the classification is based on the comparison of a test statistic T with a decision threshold C (a positive diagnostic is adopted if T > C’).The threshold C controls the bias of the classifier and higher values of C reflect more bias towards negative responses. The ROC curve is defined as the plot (a(C), /3(C)) of the probabilities of false positive (o(C)) and true positive (B(C)) classifications as functions of the decision threshold C. The main features that support the use of the ROC methodology are: 1. ROC curves allow to study the dependence between specificity (o(C)) and sensitivity @3(C))over the full range of the classifier’s decision threshold C. 2. The area under a ROC curve is a threshold-independent index of diagnostic accuracy. Correspondence to: Roland0 Biscay Lirio, Centro National de Investigaciones Cientificas, Cuban Neuroscience Center, Apartado 6880, Ciudad de la Habana, Cuba.

0020-7101/92/$05.00 0 1992 Elsevier Scientific Publishers Ireland Ltd. Printed and Published in Ireland

118

R.B. Lirio

3. Given test samples of positive and negative populations, it is feasible to obtain an empirical ROC curve, with the classifier being either automatic or subjective (expert). An empirical ROC curve is formed by the observed fractions of false positive and true positive classifications (cy(C+)$(C,)) (k = l...K) corresponding to a given number K of different values of the decision threshold. Specifically, the two methods most used to collect ROC data are the yes-no [5] and the rating [6] methods. 4. A parametric statistical model, known as the signal detection model, has been proposed for ROC data. This model is based upon general and plausible statistical assumptions and acceptable goodness of fits have been reported for a great variety of practical situations [l-3]. Computer algorithms are available for the maximum likelihood estimation of its parameters [5-61. Once the parameters are estimated, any point of the ROC curve corresponding to an arbitrary value of the threshold C, can be easily calculated, as well as the area under the curve. In many practical situations, particularly in biomedicine, the solution of complex classification tasks requires the use of multistage strategies, formed by decision trees of classifiers (for example see Ref. 7). Due to the fact that the classifier at each node of the tree has associated a different decision threshold, the classical (one-stage) ROC methodology described above is applicable at each stage, but can not be applied to the resulting full classifier. An outstanding practical example of this situation is a two-stage screening program for the detection of subjects belonging to a specific pathological population (for example see Ref. 8). Subjects classified as negative by the first classifier (R,) are definitively labeled as negative and subjects regarded as positive by R, are subsequently analyzed by the second classifier (R2). Two-stage screening strategies are usually designed to achieve both acceptable diagnostic accuracy and low cost (according to the material and human resources involved). The first and second classifiers are usually selected as having very low cost and high accuracy, respectively. The fact that only the subjects classified as positive by RI are examined by R2 decreases the cost (in comparison with the cost that would result if all the population is evaluated by R2). If a low value of the decision threshold of R, is adequately chosen, only a small fraction of false negative classifications will be obtained in the first stage and so the further application of R2 may yield an acceptable accuracy. In this paper the ROC methodology is extended by the introduction of a bivariate signal detection model to evaluate the accuracy of two-stage classifiers. An algorithm is described for the maximum likelihood estimation of the parameters of this model from two-stage ROC data. Using the estimated parameters, a family of two-stage ROC curves and a threshold-independent index of accuracy are calculated. An application of these methods to data collected in a two-stage screening for detecting auditory defects is briefly presented. Two-stage ROC Metbodology Two-stage ROC curves Consider the problem of the classification of a subject into either one of two populations, II,-, (normal or negative population) or II, (abnormal or positive

119

Two-stage diagnostic assessment

population), based on a random vector X of diagnostic features. Labels ‘0’and ‘1’ will be used to denote the two possible decisions of a classifier (i.e. negative and positive responses). A two-stage classifier R, based on the classifiers RI and R2, is defined by the following strategy of successive decisions: (i) (first stage) if the first classifier has negative response (RI = 0), then R = 0; if RI = 1, then go to step (ii); (ii) (second stage) if the second classifier also has a positive response (R2 = l), then R = 1, if Rz = 0, then R = 0. Let Ci be the decision threshold of the classifier Ri (i = 1,2). Let oi and fii be the probabilities of false positive and true positive classification for Ri. Thus, (ai( flt{CJ) is the point of the ROC curve of Ri corresponding to the value Ci of the threshold. The probabilities of false positive and true positive classifications for R are functions of both thresholds Ci and C,, say CY(C,,C,) and /3(C,, C,). These functions are surfaces that characterize the specificity and sensitivity of the classifier Rip respectively. In order to study the tradeoff between specificity and sensitivity for the two-stage classifier it is necessary to match the surfaces a and 0 over the region of interest. The following types of two-stage ROC curves are introduced as tools for this purpose. (i) al-ROC curve. This is the curve (a(Cie, C,), /3(Ci0, C,)) as function of C,, where Cl0 is a given value of Ci. This curve allows to study the relation between the surfaces a and /3 when the decision threshold of R2 varies over its full range but the classifier R, is held at a given specificity value al0 (corresponding to Cio). Note that the practical importance of fixing aI is that it controls the cost of the two-stage classification, because subjects that obtained false positive classifications by RI will pass an (usually expensive) additional examination by RZ. (ii) a-ROC curve. This is the curve (ai( @(Cl, C, (Ci, (11~)) as function of C,, where ae is a given value of a and C~(C,,ao) is the value of C, such that a(& C,) = ao. This curve represents the relation between the specificity cq of RI and the sensitivity /3 of R under the constraint that the specificity a of R is a given value ao. Note that a small given value of the specificity a is a necessary constraint imposed on a diagnostic strategy when subjects regarded as positive are subject to drastic therapeutic treatments. (iii) /3-ROC curve. This is the curve (ai( a(Ci, C, (Ci,flo))) as function of Cl, where Be is a given value of /3 and C,(C,J~~) is the value of C, such that B(Ci, C,) = /30.This curve shows the relation between the specificities a1 and a under the constraint that the sensitivity /3 of R is a given value. This constraint forces the classifier to detect at least a fraction flo of abnormal subjects. In analogy with classical ROC methodology, a threshold-independent index of accuracy of a two-stage classifier is OJOJ

A=

B(CI, C2)MG9 ss

-m

C2)

(1)

-a,

where the (double) integration is in the Riemann-Stieltjes

sense.

120

RB. Lirio

This index has a simple geometric interpretation. Consider the surface &xl, 02/i) defined as the graph of the function &xi(Ci), oz(C&)), where oz(Cz/Ci) is the probability of false positive classification by R2 using the threshold C2, conditioned to the event that RI gives a false positive classification using the threshold Ct. It can be shown that the index A is equal to the volume of the region under this surface. Bivariate signal detection model Denote by Ti the test statistic of the classifier Ri associated with the i&stage the classifier R. Then:

of

ar(Ct, C2) = P” (TI > CI and T2 > C2)

(2)

/3(C,, C,) = P’(T,

(3)

> C, and T2 > C,)

where P” and P’ are the probability measures associated with the populations II0 and II’, respectively. We define the two-stage signal detection model by the assumption that the random vector T = (T,, T2) has a bivariate Gaussian distribution i’V2(bti, ~~9, Ci) under III’ (i = 1,2). The square root of diagonal elements of C’ (standard deviations) will be denoted as a/ (j = 1, 2). By standardization of the random variables Tl and T2 in (2)-(3) we obtain

4%

Cd = p”(Z1 > h and 22 > 52)

NC’,Cd = h(Zl > bh -

aI

and 22 > b252- ad

(4) (5)

where the random vector Z = (Zt, Z2) has a bivariate Gaussian distribution with zero mean, with standard deviations equal to one and correlation p’ under the population II’ (i = 1, 2); [j = (C’ - FjO)/ejO,bj = oj”loj’ and aj = (pj’ - pjO)/cjOfor j= 1, 2. Note that the two-stage signal detection model (4)-(5) is equivalent to the classical signal detection model for the classifier of each stage. However, additional parameters p” and p’ are included because of possible statistical dependence between the decisions of the two stages. Maximum Likelihood Algorithm An algorithm for the maximum likelihood estimation of the bivariate signal detection model will be described in this section, assuming ROC data are obtained by the rating-method (a similar extension can be made for ROC yes-no data, as defined in [51). The rating-method data for one-stage classifiers [6] are collected as follows. (i) An ordinal finite set of diagnostic labels is specified which represent increasing levels of certainty of positive classification (e.g. ‘not positive’, ,‘uncertain’, ‘positive’). (ii) Given samples of the normal and abnormal populations, the classifier is required to assign one label of this set to each tested subject.

Two-stage diagnostic assessment

121

The application of the rating-method to a two-stage classifier R requires the specification of a set of levels of certainty (Q, Q,z,.....Q+) for each component classifier Ri (i = 1, 2). Following the classical modeling for rating-method data [6], we assume that there is a set of threshold values (tit, 4,~,...., &,.J of the decision threshold Ci of the classifier Ri (i = 1,2), such that Ri assigns the level of certainty Qu to a subject if and only if the value of the test statistic Ti satisfies fij_1 < Ti I tii, where j = l..ki, ki = mi + 1, [i,o = -W and [i,ki = 00 for i = 1, 2. Given a sample of size ni from the population II’, let n& be the number of subjects classified as Qli by RI and as Qul by R2 (j = l..k,, k = l..kz; i = 0,l). Then the vector (n&: j = l..kr, k = l..k,] follows a Multinomial distribution with a sample size n’ and a vector of probabilities that will be denoted by (P&: j = l..kt, k = l..kz). Thus, the log-likelihood is: 9

According to the bivariate signal detection model (4)-(5), the probabilities P’;, are determined by the parameters of the model, + bp ([jl, [jz,.e*e[jmj) 0’= 1, 2), p 6 and p’ (see Appendix). The derivative of the log-likelihood with respect to a parameter r&is:

2$=i i=O

3 ~n;k_y& j=l

(7)

kc1

Maximization of the likelihood with respect to the parameters may be accomplished by setting the first partial derivatives (7) equal to zero and solving the resulting set of nonlinear equations. The solution may be obtained by an adaptation of the Newton-Raphson method known as the method of scoring [9]. This method computes an improved vector of estimates +1+1at each iteration (t = 1,2,...) according to: @1+1

=

a, + zt-l g,

where g, is the gradient of L (i.e. the vector of first partial derivatives (7)) and ZI is Fisher’s information matrix’ (i.e. the matrix of the expected values of second partial derivatives (E(-a2L&&3))), both evaluated at the current value @,of the vector of unknowns. Detailed formulae for the calculation of the probabilities Z$ the gradient g and the second derivatives of L are given in the Appendix. They are used in the following algorithm for maximum likelihood estimation based on the scoring method: (a) Set t = 0. (b) Obtain an initial value +O by means of: (i) the maximization of the marginal (one-stage) likelihood of the classifier

122

R. B. Lirio

Rj with respect to its parameters aj, bj, {jl, a$,T,e.e[jmj, using the algorithm described in [6], for each j = 1, 2. (ii) the maximization of L with respect to p” and p’, fixing the values of the parameters obtained in (i). Since only an approximate initial value is desired and since admissible values of pi are in the bounded interval [-l,l], a search method is suggested, e.g., BRENT’s method in [lo]. (c) Calculate the value Lo of the log-likelihood at ao, substituting P$ in (6) by its values computed as described in the Appendix. (1) Compute the gradient vector g, at a, as described in the Appendix. (2) Compute the information matrix Zt at *, as described in the Appendix. (3) Update 3, + 1 according to Eqn. 8. (4) Calculate the value L,,, of the log-likelihood at Cp,,, replacing P’jk in (6) by its values computed as described in the Appendix. (5) Repeat steps l-4 until convergence (in the sense of the relative difference between L, + , and L, less than a previously specified precision 6). After convergence, the diagonal elements of I,-’ are the asymptotic variances of the estimates. Illustration of Program Output

In this section the practical use of the algorithm described above is briefly illustrated with data collected in a two-stage screening for auditory defects. The first stage decision (RJ is based on a risk index that summarizes some risk factors for auditory defects in pre or perinatal period. The second stage (Rz) consists of the evaluation of brain electric activity (Auditory Evoked Potentials in the Brainstem) by a team of experts using a computer system. The latter diagnostic technology is much more expensive than the first classifier, but allows the detection of auditory defects with a higher accuracy even in newborns. The sample (27 087 infants) was provided by some obstetric-pediatric hospitals in Havana City. Clinical follow-up of cases (audiological tests) showed that the sample includes no = 27 000 normal and n1 = 87 abnormal subjects. For further details see Ref. 8. The total sample was evaluated by the two classifiers RI and RS. It was desired to analyze the performance of the two-stage strategy on this preliminary data set in order to aid in the selection of convenient decision thresholds (according to cost and accuracy) to be used in the two-stage screening of future populations. Results of the maximum likelihood estimation algorithm are presented in Table I. As expected, the classifier R2 shows very high accuracy (area under ROC curve, A2 = 0.9994). This is also evident from the marginal ROC curves (Fig. 1). Risk factors tend to be associated with positive brain electric findings, as reflected by positive estimates of correlations between the classifiers R, and R2 under both populations (p ’ = 0 .4238 and p’ = 0.1548 in Table I). This fact can not be taken into consideration by the classical (marginal) ROC analysis. Indeed, the extended ROC methodology is necessary for the estimation of both a threshold-independent index of accuracy of the two-stage screening (A = 0.9526) and its two-stage ROC curves.

Two-stage diagnostic assessment

123

TABLE I

Parameters

Initial

Final=

al

6.4593 8.4985 3.9255 2.2251 0.3958 0.1535 -0.4763

6.5513 (0.0051) 8.4226 (0.0057) 3.7782 (0.0050) 2.4094 (0.0036) 0.4238 (0.0050) 0.1548 (0.0057) -0.475 1 0.9532 0.9994 0.9526

a2

bl 4 PO P' Likelihood Area A, Area A2 Volume A

“Standard deviations are between parentheses. The likelihood is divided by the total sample size.

As a typical illustration, suppose that the screening R is required to have not more than 5% probability of false positive error and it is also desired under this constraint, to study the dependence between the specificity of RI (which controls the cost of the screening through the number of cases to be evaluated by R2) and the sensitivity of R. This dependence is represented by the cr-ROC curve. It is observed, for example, that under the constraint u = 0.05, a sensitivity /3 = 0.9563 is achieved when the

Fig. 1. ROC data (points) and ROC curves of the first (lower) and second (upper) classifiers of a two-stage screening for auditory defects.

124

R.B. Lirio

ALPHA = 0.1000 BETA = 0.9563

Fig. 2. a-ROC curve for the two-stage screening under the constraint: probability of false positive classification a = 0.05 (see text for definition). The point (alpha&eta) = (aI(C, represents the intersection of the vertical line (moving cursor) with the a-ROC curve.

specificity of the first stage is 01~= 0.1000 (Fig. 2). Note that this means that 90% of the normal cases will be (correctly) classified as negative by RI without requiring the evaluation by Rz. Thus, the cost of the screening is substantially decreased while its accuracy is kept rather high (ar = 0.05, /3 = 0.96). colh!3ions

The ROC methodology may be extended to the case of the two-stage classifiers by introducing the concepts of two-stage ROC curves, a threshold-independent index of two-stage accuracy and a bivariate signal detection model. The latter permits the parametric statistical description of two-stage ROC data. An algorithm is developed for the maximum likelihood estimation of this model. Computational feasibility and main features of two-stage ROC methods are briefly illustrated with data of a screening program for auditory defects.

Computation of probabilities Pj, For i = 1, 2, j = 1.,.kt and k = 1.. k2 (Al) pjk = F’ (bf tlj - a;, b$ & - a:) + F’ (bi [lj - af, bi .$u( - a:)

Two-stage diagnostic assessment

Here F’(x,y) = F&J) is the standardized bivariate Gaussian distribution correlation pi, which may be computed using the expansion [l 11: Fp (XJ) = E K (Wt t=o

125

with

(A2)

WP’

where Ht is the Hermite polynomial of degree t. Computation of the gradient vector g

Differentiating (Al) we obtain

+ aF’

x

(4 flj -

a;, b: [UC_ 1 - a:)

643)

1,andbj’ = bj 0’= 1,2). Thus, the calculation of aP$a+ consist in the calculating derivatives of the general form

whereaj=O,aj=ajb!=

where Fjk = F’(bi[lj -a{, bklk -a:). All derivatives are zero with the exception of the following:

(A3 (51i

~=fW

’POt2d

(1 - #‘*

646)

*

For 4 = al, tli, bl:

$ =f

(b,Clj - al) F (b2E2k - “;I ;F2;t;[Y 1

- al)) d (4)

647)

126

R.B. Lirio

where

d(9) = 61, tli,

s a4

f (Mzc -

>

-1

for 4 = [lj, bl, ul, respectively. For d, = a*, ,&, b2: (bltlj

a2)

F

-

al

-

~1 @25x

(1 - p:)‘”

-

a2N

d(4)

648)

where d(4) = b2, fz, -1 for 4 = E2k, b2, a2, respectively. Substituting P$ in (7) and aP$a+ by its values calculated according to (Al) and (A3)-(A8), respectively, we obtain the components aLI@ of the gradient vector g. Computation of the information matrix I Based on the relation E(-a2L/&#&3) = E(aL/&p aL/@) and (7), we obtain:

(A9) where summation is performed over i, i’ E (O,l), j, j’ c (l,...,k,) and k, k’ 6 (l,..., k2). Since the lectors (n$: j = l..k,, k = l..k2) and ( njk: j = l..k,, k = l..k2) have independent multinomial distributions, it follows that E(n$nf

n

I = n”P,~n’Pjqkq

(AlO)

and E(njknjl,kv) = niP;kSj$yk* + n’(n’- 1) P;kP&v

(All)

where Sjk represents the Kronecker 6 symbol. Substituting the first partial derivatives, (AlO) and (All) into (A9), we obtain the expected values of second order partial derivatives of L. References 1

Swets JA, Pickett RM, Whitehead DJ, Getty DJ, Schnur JA, Swets JB and Freeman BA: Assessment of diagnostic technologies, Science, 205 (1979) 753-764. 2 Swets JA and Pickett RM: Evaluation of Diagnostic Systems: Methods from the Signal Detection. Theory, Academic Press, New York, 1982. 3 Swets JA: Measuring the accuracy of diagnostic systems, Science, 240 (1988) 1285-1293. 4 Massof RW and Emmel TC: Criterion-free parameter-free distribution-independent index of diagnostic test performance, Appl Optics, 26 (1987) 1395-1408. 5 Dorfman DD and Alf EJ: Maximum likelihood estimation of parameters of signal detection theory-a direct solution, Psychometrika, 33 (1968) 117-124. 6 Dorfman DD and Alf EJ: Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals- rating- method data, J Math Psychol, 6 (1969) 487-496. I Kmzynski M: On the identity of optimal strategies for multistage classifiers, Pattern Recognition L&t, 10 (1989) 39-46. 8 Perez Abalo MC, Perera M, Babes MA, ValdCs M and Sanchez: Ensayo de pesquisaje de defectos auditivos en la ciudad de La Habana, Rev Cubana Invest BiomCd, 7 (1988) 60-74. 9 Rao CR: Linear Statistical Inference and its Applications, John Wiley & Sons, New York (1973). 10 Press W, Flannery B, Teukolsky S, Vetterling W: Numerical Recipes in Pascal, Cambridge University Press, Cambridge (1989). 11 Cramer H: Mathematical MethocLFof Statistics, Alonngvist & Wiksell, Stockholm, 1981.

Maximum likelihood estimation of signal detection model parameters for the assessment of two-stage diagnostic strategies.

The methodology of Receiver Operating Characteristic curves based on the signal detection model is extended to evaluate the accuracy of two-stage diag...
683KB Sizes 0 Downloads 0 Views