231

Printed in Great Britain

A note on the likelihood equation in the ABO blood group system BY JORGEN GRANFELDT PEDERSEN Matematisk Institut, Aarhus University, D K 8000, Aarhus C, Denmark NOTATION AND OUTLINE O F PROOF

Let p , q, and r denote the frequencies of the A, B, and 0 genes in the population, and let x = ( ~ A A , X A O , ~ B B~,B O XAB, , xO) denote the number of genotypes AA, AO, BB, BO, AB, 00, respectively, in a sample of n blood types. Then the observations y of the four blood groups, A, B, AB and 0 are

Y

= (XAA

+~ A O XBB , +XBO, XAB, x o ) = n(a,b, ~ A B c), ,

where a, h, TAB,c respectively stand for the sample proportions of the 4 blood groups, A, B, AB, 0. Rather than writing down the likelihood function immediately we shall consider the problem as one of an incomplete observation from an exponential family. The probability of x is

p(X) = @pZxM ( 2 ~ r ) ~ A o q ~(2qr)QO Q B ( 2 ~ q ) ~r2x0 aB =

=

(z)

2"AO+ZBO+SAB&xAA+%O+XA@

):(

2xA0+Q0+xAB

In P+(21BBfZBO+ZAB) In Q+(2ZO+~AO+QO) Inr

(1 - p - q)2" e ( 2 x A A + ~ A O f ~ A B )&q+(2%B+XBO+Z)1n 1n

+,

= b(x)(1 + eei + e@z)-ZneT1Wi+Tz(X)'%

where

8, = lnand

P

0,

= ln-

1-p-q'

+ +

( x )= ~ X A A XAO XAB, T&) The range o f 8 = (31, 8,) for ( p ,q ) varying in A, where

4 1-p-q

+ + XAB.

~ X B B XBO

A = { ( P , 4 )I P 4 > O,P+P < 11, is the real plane. If (Tl(x),T2(x)) had been observable the maximum-likelihood estimation of dl and 8, and hence that of p , q, and r would have been straightforward. But (TI ( x ) ,T,(xz))is not a function of the observable y and we have a situation with incomplete data from an exponential family. It is true, in general (see Sundberg, 1974), in such situations that the likelihood equation considered as an equation in 0 is EeT = Ee (T 1 Y), (2) where E,T is the mean value of T and E, (T 1 y) is the conditional mean value of T given the observed y. I n terms o f p and q the mean of T is 2% ( p ,q), and noting that the conditional distribution of xAA given xAA +xAois binomial b(xAA +xA0; p2/(p2 + 2pr) and similarly for xBB given xBB +xB0the conditional mean of T given y is found in terms of p , q, and r to be

J. G. PEDERSEN

232

Hence the likelihood equation ( 2 ) as an equation in p , q is 2 %= ~ XAA

P2 +XAO +XAB + (XAA + XAO) p2+ 2pr'

2nq = xBB

q2 + XBO +XAB + (xBB +XBO) q2+ 2qr '

which is the fundamental equation p = p ' , q = q' of the gene-counting method developed by Ceppellini, Siniscalco & Smith (1955) and in subsequent papers by Smith (1957) and Smith (1967). Likelihood equations like ( 2 ) arise in a variety of contexts and lead to iterative procedures €or finding the maximum-likelihood estimates. General properties of such iterative procedures are given by Dempster, Laird & Rubin (1977). The conditional mean of T given y is found in terms of 0 from (3) and (1)

Using the fact that (el,0,) may take any value in the real plane, we find the range of EB(T 1 y) to be the rectangle A = {(L t z ) XAA + ZAO + XAB < t, < 2 (XAA + XAO) + XAB,XBB + XBO + XAB < t 2 < 2 (XEB + XBO) + XAB). The idea in the proof is to exploit the fact that the right-hand side of the likelihood equation (2) is confined to A to find a region which is known to contain the solutions to the likelihood equation. Then the second derivative of the log likelihood function with respect to (O,, 0,) is considered and a region is found where it is negative definite. Incidentally, the second derivative of the log likelihood function with respect to ( p ,q ) is less tractable than the derivative with respect t o 0 and this is the reason for introducing the parameter 8 in this note. Finally some sufficient conditions are given for the solution(s)of the likelihood equation to fall within the region where the second derivative of the likelihood function is negative definite. When this is the case the likelihood equation has exactly one solution: the maximum-likelihoodestimate.

I

A REGION CONTAINING THE SOLUTION(S) TO THE LIKELIHOOD EQUATION

Although we shall consider the second derivative of the log likelihood function with respect to 0 it will be convenient to express the region where it is negative definite in terms of ( p , q ) . Thus we shall be interested in a region in A, the set of admissible values of ( p ,q), which is known to contain the solutions to the likelihood equation. Now the likelihood equation may be written

I

2n ( P ,4) = E ( p , q )(T Y), where E(p,g)(T I y) is in A, so the solution to the likelihood equation is in ( l / 2 n ) A ,i.e. in the rectangle B = {(t,,t,) I &a BxAB < t, < a &CAB, g b *zA)j < t, < b g z ~ } .

+

+

+

+

THE SECOND DERIVATIVE OF THE LOG LIKELIHOOD FUNCTION

Considering the data as incomplete observations from an exponential family the second derivative of the log likelihood function is

(a/aO)z~w,,O,;

Y) = -VeT+GVI Y), where V,T is the variance of T and V,(TI y) is the conditional variance of T given y (see

Likelihood equation in ABO group system

233

Sundberg, 1974). It is of course possible to obtain the second derivative of the log likelihood function directly from the likelihood function without appeal to the theory of incomplete observations from an exponential family. But in this case where we want the second derivative of the log likelihood function with respect to 8 as a function of p , q it is most easily found using the formula above. Noting that T = (T,, T2)ismultinomially distributed, m(2n;p)q), the variance of T is found t.0 be

and using once again the fact that the conditional distribution of,X given xM + xA0 is binomial b(XAA + xA0,p2/(p2+ 2pr)) and similarly for xBB given xBB+xB0 the conditional variance of T given y is

We may as well consider

since our only interest is to see when (a/a8)21nL is negative definite. 2 may be written

where a,/3, y , 8, and E are functions of p , q, a, and b: a = Pq,

6 = pr,

B = qr.

The eigenvalues of X are found in terms of these functions to be ${-201+(/3-6)-1-(y-~) +J[4~2+((p-6-(y-~))21). One has that A, the smaller of the two eigenvalues is less than +(-2O1+(P-6)+(y-s)Using the relation that for any real numbers x,y

I (P-6Hy-dI

1.

2min(x,y) = x + y - Ix-yl, one finds that A, is less than - 01 + min (p- 6,y - e). It can be shown that this function is negative for p , q in A, but the argument will not be given here. It follows that the log likelihood function does not have any local minima. However, the main problem is to find, in terms of p and q, the region where the greater eigenvalue A2 = ${ - 201 (p- 6) (7-6) +J[401‘ ((p-8)- ( Y - e ) ) 2 ] }

+

+

+

is negative. This has not been possible, but an approximation may be found noting that

A,

< B(-2O1+(/3-6)+(y-,)+2,+ = max(,8-6,r-s),

1 (P-6)-(y-@]) (4)

J. GI. PEDERSEN

234

where we use the relations that for x,y positive J(x2 + y2) < x +y and for any real numbers x, y 2 max (x,y) = x +y -t- I x - y I. It is worth noting that the inequality in (4) is strict except when @ - 6- ( y - e) = 0, in which case A, = 0. Defining

< 0)

C, = { ( p , q )Ip,a>O and B - 6

and Cz = { b q ) 1p,q)O and Y - e G 01

it follows from (4) that C is always non-empty. The lines

= { ( P dI P d O and P

A, ( p ,q ) is negative for

2-J(b)-2PI,

( p ,q ) E C = C, n C,.

defining respectively the slopingboundary of C, and C , intersect at +(2 + J ( a )- 2Jb, 2 -J(a)+J b ) and this point belongs to A if J a +J b > 1. Thus when J a + J b > 1, A, will be positive for some values of ( p ,q ) in A showing that the log likelihood function is not concave (as a function of 8, and O Z ) .Indeed, the expression for A, shows that A, 2 (p- 6+ y - e ) , and when Ja +,/b > 1 the right-hand side of this inequality is positive for

+

(p,q)ED=AnC,CnC%, where Cf denotes the complement of C, and Cp denotes the complement of GI;. SUFFICIENT CONDITIONS FOR C TO INCLUDE 6:

It is obvious from the shapes of B and C that B is contained in C if and only if

+

(a +EAB, b is in C. Now this is the case if and only if

+ 4ZD)

+ 2 J ( a ) + 4b + 3ii& - 4 < 0, 4a + 2b -I-2 J ( b ) + ~ Z AB4 < 0,

2a

(5)

(6) where (5)is a necessary and sufficient condition for ( a + +ZAB, b + $xm) to lie below or on the line

q=- 2 - 2Ja -2P 1

and (6) is a necessary and sufficient condition for ( a+ +XD, b + +XAB)to lie below or on the line q = 2 - J ( b ) - Zp.Thus (5)and (6) are sufficient conditions for the likelihood equation to have a unique solution. I n order to obtain a more readily interpretable sufficient condition we note, using the relation a + b + XAB + c = 1, that (5) is equivalent to -a+2J(a)-b-3c-1

which in turn is equivalent to

< 0,

- ( J ( a )- + (J(b), < 3c. (7) For fixed c the region in terms of J a and Jb, where (7) fails is of a nice geometrical shape. The set (in a plane with coordinates x = J a , y = J b ) , where (7)fails is 0,= {(x,y) I - (x- 1)2+ yz > 3c).

Likelihood equatwr, in ABO group system

235

I

1

I

1

P

, P

Fig. 1. Two special cases of the relationship between B and C. ( a )(a, b, zAB, c) = (0.47,0*22,0-18, 0.13), ( b ) ( a , b, zAB, c) = (0.38, 0.40, 0.18, 0.04). The data in case ( b ) are not very likely to come from a sample of ABO blood types because of the low frequency of the blood group 0 ; they are chosen to illustrate the theoretical possibility that B may intersect a region where the log likelihood function is not concave.

J. G. PEDERSEN

236 Y

1

Fig. 2. The relative positions of D,and the set x2+y2 < I - c shown for the values of c used in Fig. 1. (a)c = 0.13, ( b ) c = 0.04. The straight lines are the asymptotes of the hyperbola -((x-1)2+?/2 = 3c.

0,is the upper part and interior of a hyperbola with equation -(x-l)2+y2

=

3c

and hence with asymptotes y=x-1

and y = - x + l .

This gives a clue to a simpler condition than (7). Recalling that a, b, and c are the sample proportions of the blood groups A, B, and 0 they must satisfy u + b + c < I , which may be expressed in the x, y plane as xa+y2 < 1-c. (8) When c increases this region becomes smaller and 0,moves upwards, so for c greater than some value co the two regions will no longer intersect, and for such values of c (7)will hold for all values of x and y that satisfy (8). To find co one notes that (x,y) belongs to 0,if and only if

(x- 1)2-y2 < -3c, which may be added to (8) to give the inequality ( ~ - 1 ) ~ + 6 + 4 ~= - 12 ~ ' - 2 2 ~ + 4 Q,so the left-hand side of (9) is non-negative for all x if c 2 &.Using a parallel argument the condition (10)

C>,*

is seen to imply (6), so (10) is a sufficient condition for the likelihood equation to have a unique solution. As noted above, the sloping boundaries of C , and C , intersect outside A if

,/a+,/b

< 1,

(11)

so it may be of some interest to point out that (11) is also a sufficient condition for the likelihood equation to have a, unique solution.

Likelihood equation in ABO group system

237

SUMMARY

Some sufficient conditions on the data for the likelihood equation of the ABO blood-group system to have a unique solution, the maximum-likelihood estimate, are given. The simplest of these conditions is that the frequency of the blood group 0 in the sample shall exceed Q. This condition will hold for most samples. REFERENCES

CEPPELLINI,R., SINISCALUO, M. & SMITH,C. A. B. (1955). The estimation of gene frequencies in a random mating population. Ann. Hum. ffenat.,Lond. 20, 97-115. DEMPSTER, A. P., LAIRD,N. M. & RUBIN,D. B. (1977).Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. SOC.B 39, 1-38. SMITH,C. A. B. (1957). Counting method in genetical statistics. A m . Hum. Genet., Lond. 21, 26676. SMITH,C. A. B. (1967). Notes on gene frequency estimation with multiple alleles. Ann. Hum. Genet., Lond. 31, 99-107. SUNDBERG, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scand. J . Statist. 1, 49-58.