Am JHum Genet 31:161-175, 1979

Maximum Likelihood Estimation by Counting Methods Under Polygenic and Mixed Models in Human Pedigrees JURG OTT

SUMMARY

For pedigree data, the maximum likelihood estimates of the parameters in polygenic and mixed models are derived analytically although not in closed form but in terms of "counting equations" allowing an iterative solution. Likelihood computations, tests of significance, and tests of goodness of fit are presented. Accelerating the (linear) rate of convergence by a very simple method is demonstrated.

1. INTRODUCTION The classical biometrical approach to the analysis of quantitative traits does not allow for the presence of a predominant single gene. The segregation of single genes is most easily studied in dichotomous traits. A combination of the two approaches (i.e., the demonstration of a major gene in the presence of quantitative variation) has been difficult, especially for the analysis of pedigree data. In an early attempt, Mittmann [1] fitted the offspring distribution of various crosses to expected curves under two gene hypotheses. The first gene was considered the major gene, while the second gene served as an approximation to the joint effects of polygenes. Its influence on the phenotype was allowed to vary from zero to 100%. Mittmann constructed a test statistic, computed its standard error, and applied his techniques to published data on crosses of the moth, Ephestia kiihniella. In the sixties, an enzyme distribution with a seemingly normal shape was recognized as a mixture of distributions with means corresponding to the genotypes at a single

locus [2]. Penrose [3] compared the effects of additive polygenes with those of a few single loci on parent-child and sib-sib bivariate distributions. Based on the differences

Received March 8, 1978; revised October 17, 1978. This research was supported by grant GM 15253 from the National Institutes of Health. 'Center for Inherited Diseases, RG-20, University of Washington, Seattle, Washington, 98195. © 1979 by the American Society of Human Genetics. 0002-9297/79/3102-0002$01.30

161

OTT 162 between the two members of each of the two types of pairs, he developed a statistic and its standard error to estimate the number of loci involved in the expression of a quantitative trait and applied his results to data on stature, cephalic index, etc. Birnbaum [41 proposed graphical methods for comparing observed offspring distributions and those expected under Mendelian inheritance which he then applied to data on skeletal variation in the mouse. Weiss [5] reviewed the two lines of thought (i.e., the biometrical and the Mendelian approach) and proposed an iterative algorithm for the separation of genotype distributions in relatives. Statistical properties of this procedure are not known. A likelihood approach to the estimation of single effects vs. polygenic effects in pedigree data was pointed out by Elston and Stewart [6]. This paper derives the maximum likelihood estimates of the parameters in a model including single gene and polygenic effects for general pedigrees. This is done on the basis of the conditional multivariate distribution of pedigree data given the genotypes at the single locus. Because of the logical connection to the polygenic case, this hypothesis is considered first, and the results extend easily to include major gene effects. The estimates are given analytically, although not usually in closed form but in equations permitting an iterative solution (counting methods). Computational aspects of estimation and goodness of fit tests are also considered. The analysis of pedigree data shows some formal connections to the analysis of observations in an unbalanced design [7]. 2. COUNTING METHODS

Counting methods were first used in human genetics by Ceppellini et al. [8] for the iterative maximum likelihood (ML) estimation of gene frequencies in ABO-like systems. Such methods were introduced into general pedigree analysis by Ott [9, 10]. Dempster et al. [11] showed that counting methods (and several other iterative procedures in common use) are all special cases of a process termed the EM algorithm which maximizes the likelihood of incomplete data [i.e., data in which the parameters to be estimated specify the distribution of underlying variables (e.g., genotypes) that can only be observed indirectly through the phenotypes]. Setting the first derivative of the likelihood equal to zero leads to analytical representations of the ML estimates in the form of iteration equations (counting equations, [10]). These have to be solved with initial parameter values yielding values with a likelihood higher than the original estimates. The solutions are then used to obtain a set of parameter values still closer to the correct ML estimates. The iteration process stops when a certain criterion is satisfied (e.g., when the increase in likelihood between two successive iterations is very small). When the phenotypes allow a direct observation of the underlying variables, then the counting equations become identical with the usual ML estimates (i.e., the ML is attained in a single "iteration"). The counting equations can generally be represented in an alternate form [10] pi+, = pi + vi Zi' where pi = value of a parameterp at ith iteration, Z1' = first derivative of the log likelihood, vi = variance of the ML estimate of p given direct observation of the underlying variables. In Newton-Raphson iterations, vi is set equal to -lIZi". Practical application of the counting equations is easiest in the alternate forms with numerical evaluation of Zi' [12]. While counting iterations tend to have excellent properties for poor initial esti-

COUNTING METHODS UNDER A MIXED MODEL

163

mates, the convergence is linear in the vicinity of the ML [11]. The formulas derived below refer to data from a single pedigree. Applications to more than one pedigree are considered in the discussion. 3. POLYGENIC INHERITANCE

Consider a pedigree with n individuals and a measurement xi, i = 1. each. The mathematical model forxi under polygenic inheritance is

n on

(1) xi = / + ai + ci + ei where g is a general mean, ai is an additive polygenic effect, and ei an environmental effect. The common sibship effect ci is shared by all individuals in the same sibship. The effects ai, ci, and ei are all random, g being the only fixed effect. Assume mutual independence among the random effects and normal distribution for each with means zero and variance components 0ra2, o-2 and 0>2. Then, xi follows the normal distribution N(A, o-2) where o-2 = (Ta2 + OJ2 + re2. The observations on the pedigree members can be represented by the data vector x = . (xx, XJ so that the model (1) can be written in vector notation as x = tkI + a + c + e (2)

where 1 is a vector of ones. The random vectors a, c and e follow the respective normal distributions N(O, 0-q2A), N(O, oc2C), and N(O, 0>,21), 1 being the n x n identity matrix so that the covariance matrix of x is given by Q= oa2A + O_'2C + 0>21. (3) The matrix A contains the coefficients that, when multiplied by o-r2, determine the additive polygenic portion of the covariance between the relatives in the pedigree. The matrix C contains elements cii = 1 along the diagonal, cij = 1 if the ith and jth individuals live in the same sibship, and cij = 0 otherwise. The elements of the A matrix can easily be obtained for general pedigrees by published algorithms [13-15]. Following the practice of Morton et al. [16], model (2) does not contain any polygenic dominance effects but instead allows for common sibship effects. Consequently, consanguinity in a pedigree does not pose any particular problems. For nonconsanguineous pedigrees without double cousins where dominance effects occur in sibs only, common sibship effects and dominance effects are completely confounded as follows. Imagine a model x =, + a + d + e (4) which differs from (2) in that c is replaced by a vector d of dominance effects which is distributed as N(O, (rd2D). The D matrix contains elements dii = 1 along the diagonal, dij = 1/4 if the ith and jth individuals are sibs (i * j), and d1, = 0 otherwise. Clearly, the linear relationship 4D - 31 = C holds so that fl = 0>,'1A + o-2(4D -3I) + eI = r~a2A + 401c2D + (0r2 - 30C2)1 [i.e., the variance components (-C2 and cre2 of model (2) have to be interpreted under model (4) as a dominance variance of 40rC2 and an environmental variance of -e02 - 30rC2 respectively when c and d are completely

confounded].

164

OTT For model (2), with o-2 = 0, Elston and Stewart [6] have described an algorithm to recursively c6mpute the likelihood of a simple pedigree. Estimation of /1, Ora2 and cre2 was then done by searching the likelihood surface. Using path coefficient analysis for various pairs of relatives, Rao et al. [17, 18] have estimated several effects in addition to the ones considered in model (2) including genotype-environment correlations. Such effects can be expressed through variance components associated with appropriate coefficient matrices. Lange et al. [13] have estimated the variance components in model (4) by the scoring method which finds the ML estimates iteratively based on the first and second derivatives of the likelihood. Maximizing the log likelihood of error contrasts, Thompson [19] estimated the heritability (i.e., oa2/a2) for specific family types.

3.1 Counting Equations Under model (2), the observations x are distributed as N(pLI, fl) where fl is given by (3). The likelihood L is thus proportional to the multivariate normal density. That is, L

=

Ili

exp

[-42

(x

-al)tl-1( (-Il)].

(5)

Throughout this paper, let L' denote the partial derivative of L with respect to the parameter under study and xt the transpose of the row vector x. 3.1.1 Means. For the mean Au, using Z = logeL, L' = LZ' and some algebraic manipulation, one obtains L' = L(xt C 1 -A.lt Cl-'1), which is the multivariate analog to the last formula on p. 449 in [10]. Setting L' = 0 leads to I l-'1 XI (6) xtCll which can be solved explicitly if values of the variance components are given. An alternate form of (6) is obtained by setting xt l-1 1 = putI '1 + L'IL so that, withZ' =

= L'IL,

.=.L+ Z'ilt l1 1,

(7)

which is a short notation for (8) k+1= /k + Z'(pk)l-t C1_ 1 where Ak denotes the value of 1L obtained in the kth iteration. For convenience, all iteration equations below are given in the short notation. If all n individuals in the pedigree are independent (i.e., if ora2 = aC2 = 0) then Cl = or2I, It C-' I = ni'9 and equation (6) becomes A' = x where x is the arithmetic mean of the observations. For a sibship, all members have the same correlation p = (.5 ora2 + oc2)/o2 so that Cl = cr2R, R being the correlation matrix with diagonal elements rij = 1 and off-diagonal elements ru = p. Then,j 'l-' 1 becomes n'/a'where n' = n(1 +np - p). For p 2.0, n' ' n so that n/o- is an upper limit to 1I Cl' 1. For general pedigrees, one can define n' = or1It ClI 1 so that (7) becomes (9) A = A + Z'ar2/n'. The value of n' depends on the values of the variance components and is thus not

COUNTING METHODS UNDER A MIXED MODEL 165 constant. One can use (9) in an approximate way by substituting n for n' which makes, at least for sibships, the iteration steps smaller. The value taken for n' in (9) does not influence the resulting estimate of A, since the iteration process stops in the vicinity of Ai where Z' = 0, which is the case no matter what value n' has. While the computation of (6) requires the solution of fly = 1, the approximate evaluation of (9) requires the calculation of two likelihoods if Z' is approximated by Z'(pu)- [Z(,u+8)-Z(pu)]I8, (10)

8 being a small increment of Au. A procedure for the recursive calculation of L is described in the following section. In some situations, one may have to assume different means for different individuals in a pedigree. In particular, we may assume, as in [13], that there is a sex difference in the level of x. When the n individuals are ordered so that the first n 1 are females and the remaining n2 are males with respective means Al and 2, then A1 in model (2) can be replaced by g = (,Al !U2 1, l). Accordingly, let fl-1 be partitioned into submatrices fij ij = 1,2. Also, let x be partitioned into (xl, x2)'. Because xl and x2 are generally correlated, the ML estimates of Ai cannot be obtained from xi alone. Taking the partial derivative of (5) with respect to pA, one obtains L' = L(itfl'x, + I I''2X2 - -A P t l (11) A2 It f12l). Setting L' equal to zero leads to = [1' fillx1 + 1' f12 (x2 - 20,l)]/l' fQ"" (12) with the alternate form given by Al = Al + Z'/l'I 'l 1 which is very similar to (7). In analogy to the discussion of (7) above, one can define n = o 1 1l I 1 (13) so that the counting equation for ,l is given by

(14) Al = Al +Z' r&/n1' which can be approximated by n1' = n 1 and (10). The equations for P2 are obtained in the same manner. Iterations over A, and A2 jointly are started with initial values for the two parameters. With these and a given value of o2, the corresponding two counting equations are solved, the results used again in the counting equations, and so on. Linear regression in pedigree data under a model with only single gene and environmental effects has been considered earlier [10]. For polygenic inheritance, taking age as an example, this can be treated as follows. Consider model (2) and let ,u depend linearly on the age hi of the ith individual (i.e., each individual has its own mean, Ai = a + 8hi). With the vector of ages h = (h1, . . . , h,) the mean vector is given by a = aI + 1bh. Since (x-,a) is now equal to (x-,13h- al), a can be estimated in analogy to A in (5) as a = (x-fh)t 1 j/jt f-I 1 with an approximate alternate form of a = a + Z' a/n. Taking the partial derivative of (5) with respect to f3, one obtains L'IL = (x-al)' Q-lh -h' f-1h. Setting L' = 0 yieldsf = (x-al)t' -lh/ht r-'h with an alternate form of

OTT 3 + Z'/h15-1h.

166

3= (15) To approximate (15), consider fl = o2R with correlation p - 0 for which ht fl-'h .Jz2/-2. For general pedigrees, (15) can thus be approximated by /3 = /8 + Z'r2/jz~h2. If u depends on age and sex, then one has to estimate three regression parameters (i.e., the levels a1 and a2 for the two sexes, and the slope /8). 3.1.2 Variance components. For the estimation of (Ta , first consider the theoretical situation in which the additive effects ai in (1) could be measured on each individual. The variables a are distributed as N(O, Oa2A) with likelihoodf(a) = (21T)-,/2 (Ca2)-n'2 IA 1-1/2 exp(-l/2at A-la/0a2) and derivative

f'(a) = ½/2f(a) (atA-I alcra4 so that the ML estimate of cra2 is easily obtained as (2 = a'A1a/n

-

n

/0Ja2),

(16) (17)

Since the a are not known, write the likelihood as L =f(x) = f(xla)f(a) da (18) where da = da1 . . . dan and f denotes an n-fold integral. The densityf(xla) is N(g +a, (Cc2 C+cre2 1) which is free of 0ra2. Thus, using (16), L'= + ff(xla)fi(a) (a'A'l a/a4-nlra2) da.

(19)

SettingL' = 0 and solving for ca2 withf(xla)f(a) = Lf(alx) leads to

0ra2= fatA -af(alx) daln =E(at A-' alx) In > 0

(20)

which nicely shows the analogy to (17). Equations (19) and (20) can be combined to obtain the alternate form

aa2 = Cra2 (I + 2Z'O-a2/n).

(21) A direct solution of (20) is more complicated and hardly suitable for practical application. Withf(alx) being N[cra2 A fQ-l (x -,u), cra2 A (I - ra2 Q-I A)], one obtains E(at A -1 aix) = trace A -1 E(aatlx) so that, after some algebra, the result is = ra2(1 -cra2 siln)

(22) where si is the ith diagonal element of fQ- A - Q-l A f-l (x-,ul) (x-,ul)' ,which is quite cumbersome to compute. The counting equations for o-e2 and cre2 are obtained analogously and lead to alternate expressions of the form (21). These show that zero is not a suitable starting value for the iterations over the variance components. Ua

3.2 Recursive Computation of the Likelihood As the alternate forms of the iteration equations are more convenient than their analytical counterparts, only the computation of the likelihood L is considered here. Partial derivatives can be approximated as in (10). For model (2), with cr,2 = 0, Elston

167 COUNTING METHODS UNDER A MIXED MODEL and Stewart [6] described an elegant recursive algorithm for the computation of L in simple pedigrees. They used the fact that the observations of the offspring are mutually independent given the parental additive polygenic values. For model (2) and any type of pedigree, the likelihood can be computed using matrix methods, as in [13], by solving the system of equations fly = (x-A). A recursive procedure can be obtained by regression methods as follows. With given parameter values, the n observations follow a multivariate normal distribution with covariance matrix il = (or-j) given by (3). It will be convenient to subtract its mean from every observation. The resulting adjusted observations are denoted here by x = (x1, . . ., Xn). They have thus an associated mean vector e = 0. Using basic facts for conditional distributions, one can write the likelihood as (23) f(x) =f(x1, . . . ,xn1Ixn)f(xn) wheref(xl, xnllXn) =f(xl*, . . ., xnl*) represents the normal distribution of n -1 transformed values xi* which are independent of xn [20]. With Yin = Cinl(nn denoting the regression coefficient of xi on Xn (Ann = variance of xn), the transformations xi xi*, i = 1, ... ,- 1 are given by (24) xi* = Xi - Yirn * The new values xi*, modified by Xn, all have again zero mean but new variances and covariances given by (25) City = -i- (Tin 3Jjn / ann = -i- Yin ojnThe next cycle in the recursive procedure follows.

Xn-l*) =f(Xl*. * X.-2*in-1*)fXn-1*) .(l * -

f(x) =f(xl**,

.

...xn2**)f(ni*)f(xn)

(26)

where the xi** are a set of n-2 newly transformed values. The necessary transformations of the observations and modifications to the covariances are analogous to (24) and (25). The process continues in the same fashion until the last set of transformed , xn). Ordering the observations consists of a single value corresponding tof(x I1x2, . pedigree members such that the last elementsxn, xn-1, . . . correspond to the originals (married into the pedigree) accelerates the process because these are independent of their spouses and spouses' ancestors. This is preferable to. the "natural" ordering in which parents precede their children, since the observations of two independent parents are dependent conditional on their children's values. The univariate densities of transformed values in (26) can easily be calculated as they have mean zero and their variances have been computed in the recursive process. This process is based on the representation

fix)

h-1

=

Ti= f(xixi +1

*xn)f(xn).

. . .

(27)

It is clearly a type of orthogonalization process, the variances of the univariate densities in (27) corresponding to the eigenvalues of fQ. Members of a sibship share variances and covariances among themselves whether or not they have previously been transformed, so that the inverse and the determinant of

168

OTT that partition of Q corresponding to a sibship can be obtained analytically [20]. Consequently, the values of siblings can be treated as a single block so that for a sibship, only a single cycle in the recursive procedure is necessary. Consider a vector x = (x,.x2)t of observations, all with zero mean, which may already have been modified in previous cycles of the recursive procedure. Let the covariance matrix fl be partitioned so that fll, refers to the vector xl of k siblings and contains diagonal elements or,, and off-diagonal elements all equal to Cr12. The observations x2 can be transformed into x2* through regression on x1 by -2* = -x2 - Q21 [I - (0r12/S) J]X1/(o-11 - cr12) where s = o(r + (k- 1) 0c12, I being the k x k identity matrix, and J being a matrix consisting entirely of ones. The determinant of ill is needed in the computation off(xl) and is given by IfI11 = (-1 - a12)k-ls. 3.3 Goodness of Fit Test Under model (2), the observations in a pedigree follow a multivariate normal distribution with a rather small number of unknown parameters. Once these have been estimated by ML, the observations x can be orthogonally transformed into mutually independent variables. The recursive procedure described in the last section terminates with independent values which, after division by the square root of their variances, constitute a random sample from the standard normal distribution. A histogram with k classes can now be formed from these values, the expected class frequencies being computed from the normal distribution. The goodness of fit test consists then of computing X2 = Y. (0i - E )2/E, where 0i and E, denote the respective observed and expected class frequencies. Under model (2), for large numbers of observations, X2 follows a chi-square distribution with the degrees of freedom being equal to k- I minus the number of estimated parameters. 4. MAJOR GENE AND POLYGENIC BACKGROUND

Consider an autosomal locus with two alleles, B and b, with respective gene frequencies p and I-p. It is assumed that each of the three genotypes BB, Bb and bb makes a specific contribution to the phenotype. If this genotypic effect is labeled gi for the ith individual, then the mixed model of inheritance for an observed value xi can be specified as

(28) gi + ai + ci + ei where a1, ci and ei are the random effects considered previously, and gi is an unknown fixed effect. With ci = 0, model (28) was already considered in [6]. Each gi can be equal to any of the three means ,u,, /12 or /3 corresponding to the genotypes BB, Bb and bb around which there is random polygenic and environmental variation. An additional term u in (28) for a general mean is not needed, as it can be absorbed in gi. In vector notation, model (28) reads x = g + a + c + e. Because of the term g, x does not follow a multivariate normal distribution but rather, the likelihood is a mixture of such xi

=

distributions,

f(x) = If( Ig) P(g),

(29)

where the (multiple) sum is over all possible genotype arrangements in the pedigree,

COUNTING METHODS UNDER A MIXED MODEL 169 and the f(xjg) are multivariate normal densities with mean vector g and the same covariance matrix fl (3) for each of thef(xtg). Representation (29) has been used before for single gene models in linkage and segregation analysis [10]. The mixing proportions P(g) in (29) are determined by p and the pedigree structure. Model (28) thus contains 7 parameters: 3 means, 3 variance components and one gene frequency. Morton et al. [ 16] used likelihood methods for nuclear families under model (28) with a slightly different parametization. They computed the likelihood of a sibship conditional on the parental phenotypes. 4.1 Counting Equations To estimate the gene frequencyp, consider the probability P (g) of a genotype vector in (29) which can be written as P(g) = [IP(g) HP(gil)

(30)

where the first product is over allm individuals marrying into the pedigree (originals). Under Hardy-Weinberg equilibrium, P(gj) is determined by the gene frequencies. The second product is over all other individuals in the pedigree. It contains the probabilities of the genotype of the ith individual given the parental genotypes and does not contain p. Under random mating,

P(g1) = I

p2

ifgi =A

(I -p) if gi = A2 (l-p)2 if gi = A3. Let, for given g, mi denote the number of those originals with mean tki, ml +m2 +m3 being equal to m. Clearly, mi will generally depend on g and will change from one to another genotype vector. For ease of notation, no subscripts g are given in any formula below. With this, one can write Hl 1 P(g1) =rp2mn[2p(1-]rM2 (1 -p)m3 so that the likelihood becomes m L =f(x) = 2?f2p2ml1+1l2 (1 -p)n2+2f3lJf(xlg) H P(g )

The partial derivative of L with respect to p can be computed as L'= (2m1+m2-2mp)ft(xlg) P (g)/[p (1 -p)]. Setting L' equal to zero leads to

p = E (2m 1 +m2)f(xlg) P (g)/(2mL) sincef(x1g) P(g)/L = P(g x). The alternate torm of (31) is easily obtained as p =p

=

E(2m 1 +m22x)I(2m)

+Z'p(l-p)I(2m).

(31)

(32)

This same formula applies to segregation and linkage analysis treated previously [10] under the assumption of fixed p. A given arrangement g assigns a value g = u1, /2 or /3 to the ith individual. Assume /L1 < /L2 < ,U3. To find the ML estimate for one of the means (e.g., Al) imagine, for given g, all individuals ordered by increasing value of gi so that the individuals with mean tul are the first n, of the n individuals in the pedigree, while the

OTT 170 remaining n -n1 observations have mean vector Ad with elements A or /3. Let the Q matrix and the vector x of observations be partitioned accordingly as it was done in section 3. 1. 1 so that fl has dimension niIx n1 . Clearly, n 1 as well as fl again depend on g. The derivative of L with respect to p can be calculated as L' = .[!t ll(xl Al1)+It f (x2 - 2)] P(X1g) P(g) which is analogous to (11) where all individuals were considered to have the same genotype. Setting L' equal to zero leads to

Al= h[l fll X + jt (12(2- /2)] P(Xlg) P(g)Y(1t 11j1) P(xlg) P(g) (33) The alternate form of (33) can be found as Al = I1 + Z'/1(1t

fll, i)P(glx)

(34) where the denominator equals E(!t WI Ix). In practice, this can be approximated by nl'/r2 where ni' is an estimate for the expected number of individuals assigned to the distribution with mean ,l. The counting equations for A2 and A3 are analogous to (34). For restrictions such as ,l = p,2 (recessive case) or 2 = A3 (dominant case), the counting equations have to be modified appropriately. For the estimation of a,2 the derivations are similar to the ones in the polygenic case. The likelihood can be written as (35) L = Yff(xjag)f(a)daP(g)

which is analogous to (18). Taking the derivative of L with respect to -,a2 and equating it to zero leads to ora2 = E(at A-lalx)/n whose analytical expression is not suitable for a practical use and is not given here. The alternate form of the counting equation for Orq2 turns out to be identical to (21), the difference between the two models thus being completely absorbed in the likelihood and its derivative. The same holds true for oc2 and cre2. 4.2 Likelihood Computations Three methods of likelihood computations and approximations to it will be discussed for the mixed model: (1) enumeration of all genotype vectors with probability P(g) > 0, (2) random sampling of genotype vectors, and (3) numerical integration over breeding values. The first two methods are applicable to general pedigrees while the last one is restricted to simple pedigrees. Methods 1 and 2 also apply to linkage and segregation analysis [10] as far as genotype vectors are concerned. 4.2.1 Enumeration ofgenotype vectors. With n individuals in a pedigree each with three possible genotypes, the total of all genotype assignments equals 3f. Since many of these will be incompatible with the existing pedigree structure, the number of genotype vectors with P(g) > 0 is much smaller but still large enough so that at present 10) can be so analyzed. It might be possible to analyze only small pedigrees (n larger pedigrees in special cases (e.g., when a gene is rare in the population having entered a pedigree through one individual, while all other originals have only one possible genotype). As an example, consider a nuclear family (i.e., a pair of parents with n-2 offspring). For these individuals, the number of genotype vectors with positive probability is given by 4+2n+3n-2 which is much smaller than 3n. Once a list of genotype vectors with associated probabilities given by (30) is compiled, the =

171 COUNTING METHODS UNDER A MIXED MODEL likelihood is easily computed by carrying out the summation in (29) over all these genotype vectors. This is an extension of the complex segregation analysis in nuclear families [16]. Note that here, no numerical integrations are required, and the method can be applied to pedigrees of arbitrary structure. Since n is necessarily small, f(xg) can be evaluated analytically. The changing genotype vectors affect only the means, not the covariance matrix. The computer program COMLIK was written by Richard Lee to carry out the computations and is available from the author. The first partial derivatives are approximated numerically in COMLIK. As an example, the IgE levels in nuclear families analyzed by Gerrard et al. [21] were partially reanalyzed by the method described above. One family with 10 offspring was deleted (12 out of 781 individuals) because it alone would require the consideration of 63,149 genotype vectors. For all other families together, 11,990 genotype vectors have positive probability of occurrence. The data were used after transformation by formula (1) in Gerrard et al. [21] with r = 6 and p = -.639. The seven parameters used here can be expressed in terms of the seven parameters (V, U, H, B, q, t, d) in Gerrard et al. [25] by p= I1-q, g, = U-q't -2q(I-q td, I,2=dt+,ul, Iu3=t+111, 0 a± 2c2 +e2 =V-(p/Li)22p(1 p)U22-(l-p)2 32+U2, 0 2=H22 2 =Bo2. Several lines of table 7 in Gerrard et al. [21] were recomputed by COMLIK and the POLYPARM program of Lange et al. [13] which gives results identical to those of COMLIK when the same hypotheses are tested. Table 1 shows the comparisons between the three methods for some of the hypotheses considered by Gerrard et al. [21]. According to their table 7, the general hypothesis has the same likelihood as that with the restriction d=B = (U1 =,2; 0>c2=0). It was therefore convenient to compare this hypothesis with q =t =d =B =0 (011 =U2 = /.3; Oc2=0; p = I) for which the computations were carried out by POLYPARM. The chi-square for this comparison was 3.36 (2 df. ) which is, in contrast to the results of Gerrard et al. [21], not formally significant. The starting values used for the iterations by COMLIK were those reported by Gerrard et al. [21] for d=B=O. The maximum likelihood was reached by COMLIK in about 50 iterations (cost for computer time: $300.). For the polygenic hypothesis q =t=d =0, the results in Gerrard et al. [21] were compared with those obtained by the POLYPARM

TABLE 1 PARTIAL REANALYSIS OF DATA IN GERRARD ET AL. [21] Hypothesis

d=B=O

Method

p

A1=14

1sa3

M

.45

-.56

.84

.10

.42

.6172

C

.38

-.57 -.13 -.13

.59 = l =

.22 .47 .44

4.0522 2.3747

.01

.40 .47 .46

-.13

=

.36

.07

.52

2.8554

&S

a

in L + 327

'

i..................

q=t=d=B=0 .P .................

q=t=d=0/ .

M

P

.

...

...

NOTE.-M = method of Morton and MacLean [16], C = COMLIK, P = POLYPARM. The hypotheses are given in terms of the parameters in [21]. Ln L = log likelihood as computed by COMLIK or POLYPARM.

172

OTT

program (last two lines of table 1). The agreement between them is not perfect (also noted in the comparison of lines 1 and 2 for COMLIK). The differences must be due to the fact that in Gerrard et al. [21], the likelihood is computed conditional on the parental phenotypes, while in COMLIK and POLYPARM, unconditional likelihoods are computed. Also, one large family (1.5% of all individuals) was missing in this analysis. 4.2.2 Random sampling of genotypes. In many areas of technology and business, when the number of items to be surveyed is so large that a complete listing is not practicable, the method of taking a random sample for an approximate result is being used successfully [22]. As proposed earlier [9], the same strategy could be applied to genotype vectors. This approach has the disadvantage that the results will randomly fluctuate around the correct ML estimates, but there is also the advantage that the investigator can determine the precision of the results and thus their cost in terms of computer time by varying the sample size. To randomly sample an element (a genotype vector g) from P(g;p) with given gene frequency p, one proceeds as follows. To each original in the pedigree, a genotype is randomly assigned by the usual methods [i.e., a uniform random number u is drawn which determines the genotype as k (k = 1 forBB, 2 forlBb, 3 for bb) by qk-l C u < qk where qk is the cumulative sum of genotype frequencies up to and including the kth genotype]. This can be done very quickly [23]. The given genotypes of two parents then determine the probabilities for the genotypes of their offspring by the Mendelian laws. For each child, a genotype is then selected in an analogous way until the whole pedigree is exhausted. To sample another genotype vector, this process is repeated. If the gene frequency is small, often none of the originals will be assigned a genotype with the rare gene. In such ill-conditioned situations, it may be advantageous to use a device communicated to me by Dr. K. Lange. The probability of genotype vectors is written asP(g; p) = c P(g;0.5) where c = P(g;p)IP(g;O. 5). One then samples from P(g;0.5) rather than from P(g; p) and, for each genotype vector obtained, computes the correction factor c to be applied to P(g;0.5). The scheme described for sampling genotype vectors is based on the representation (30) of P(g). It generates genotype vectors in a "natural" way and independently of the observed phenotypes. With qualitative phenotypes, especially when little phenotypic information is given on the originals, most genotype vectors sampled will be incompatible with the phenotypes [i.e., P(xlg) = 01. In principle, this does not occur for quantitative phenotypes. When phenotype-genotype incompatibility is a problem, the situation can be alleviated (although not completely overcome) by using a representation of P(g) other than the "natural" one (i.e., by sometimes going "upwards" in a pedigree). For example, consider two parents (individuals 1 and 2) and their child (individual 3). Then,

(36) P(g) =P(g1) P(g2) P(g3g12) which is the basis for the "natural" generation of genotypes. But P(g) can also be written as (37) P(gT = P93) P(943) P@gI fl3) where P(g3) is given by the population gene frequencies, and the genotypes are

COUNTING METHODS UNDER A MIXED MODEL

173

generated in the order of individuals 3-2-1 rather than 1-2-3. If the phenotypic information rules out some genotypes of individual 3 but not of individuals 1 and 2, then representation (37) is more economical than (36) because incompatible genotypes can be prevented at the first stage of random sampling. This will be especially beneficial for larger pedigrees. At least one individual in a pedigree can be chosen arbitrarily with which to start the generation of a genotype vector. One is not bound to generate genotype vectors from the "top" of a pedigree "downwards" but in any direction. With quantitative phenotypes, sampling directly from P(glx) rather than from P(g) should prevent the occurrence of genotype vectors for whichf(xIg) is very small. Assume now that a number n, of genotype vectors is being randomly generated, and denote the jth member of this random sample by gJ, j = 1,.n . , n,. Rather than keeping track of the generated vectors in order to produce a histogram that should approximate P(g) in (29), consider each vector generated as a new event. By virtue of random sampling, the n, vectors are independent and have the same probability of occurrence [i.e., P(gi) = l/nJ]. All (multiple) sums over the genotype vectors can thus be approximated by a sum over the sampled vectors. For example, the likelihood (29) is approximated by S = (1/n,) EIf(Ilgj). By the law of large numbers, S -*f(x) as n, -a oo. 4.2.3 Numerical integrations. Elston and Stewart [6] described a fast procedure for the recursive computation of the likelihood in simple pedigrees under a single gene model [i.e., model (28) with 0-a2 = 0>r2 = 0]. It is based on the conditional independence of the phenotypes given the genotypes at the single locus. This conditional independence no longer holds when allowancing for polygenic effects. Under model (28) with ci = 0, a possible solution is to also condition on the additive polygenic effects ai so the likelihood can be written as in (35). Then, the phenotypes are again conditionally independent. The multivariate normal densityf(a) of the additive polygenic effects can be written as Hf(ai) fIf(a i ) in complete analogy to P (g) in (30). For the actual calculation of (35), the densitiesf(ai) andf(ail ) have to be approximated by histograms so that the multiple integral in (35) can be replaced by a multiple sum. The effects gi + ai can thus be considered as genotypic in a broader sense, and for simple pedigrees, the likelihood can be computed approximately in a recursive fashion similar to the algorithm by Elston and Stewart [6]. 4.3 Goodness of Fit Tests With ML estimates of the parameters in the mixed model (28), a proper likelihood ratio test of the polygenic hypothesis, Ho0p=0, versus the major gene hypothesis, HI1:p>0, can be done. If Zi, i = 0,1, are the maximized log, likelihoods under the hypotheses Hi then, for large numbers of observations, x2 = 2(Z1-Z0) follows a chi-square distribution with the appropriate degrees of freedom. However, if Ho cannot be rejected, this does not imply a good fit of the data to Ho as both, Ho and HI, may fit the data equally badly. A goodness of fit test for H0 is described in section 3.3. For the mixed model, a goodness of fit test is difficult to derive asf(x) is given by (29) is a complicated density. The best solution seems a computer-implemented test by Monte Carlo methods. The rationale is as follows. A (univariate) statistic has to be defined such as the sample mean or median which should best characterize the

174

OTT

distribution of observed data. Such a simple statistic cannot be interpreted in the usual way for dependent observations (related individuals) and has no meaning by itself, but it can still serve as a useful characterization of the data. Under the model whose fit is to be tested, with parameter values as given by their maximum likelihood estimates, one now randomly generates one observation for each of the pedigree members studied and repeats this n, times. The same statistic is then computed for each of the n, sets of randomly generated observations so that the density of this statistic under the assumed model can be approximated by a histogram. Then, for an approximate test with 95% confidence (5% level), one checks if the statistic computed from the actually observed measurements lies within the 21/2% largest and 21/2% smallest generated values of the statistic. If yes, a good fit of the data can be assumed as judged by the statistic chosen. The random generation of an x-vector from (29) is easily accomplished by generating, for each pedigree, a vector of means ,u associated with a genotype vector g as described in 4.2.2 and subsequent generation of a multivariate normal data point from N(p,fl). Other presumably useful statistics are the sample variance, skewness and kurtosis computed from x disregarding the dependencies among the elements of x. 5. DISCUSSION

The counting equations derived in this paper refer to observations on a single pedigree. If for several pedigrees, Zj denotes the log likelihood of thejth pedigree, then the log likelihood of all observations is given by Z = L Zj where the sum is over all pedigrees. With this, the counting equations can easily be derived for several pedigrees. They are generally more complicated than for a single pedigree. However, the approximate alternate forms are the same no matter how many pedigrees are considered. They are thus the formulas of choice. These alternate forms also have theoretical advantages over the regular counting equations. Let P i+ = 0(pi) be such an alternate form where Pi stands for the value of any parameter p in the ith iteration. Clearly, for the ML estimate j, the equationp = +(p) holds. The convergence is linear [11] so that the convergence factor is given by +'(i) [24] which can easily be computed from the alternate forms (note that Z' = 0 at pj). It can also be estimated in the course of the iterations by (pi+I -pi)(pi-Pi-1). As an example, if the unknown parameter p denotes the gene frequency in a mixed model, then 0(p) = p + Z'p(l -p)/(2m) by formula (32). The convergence factor can be = 1 + Z"i(l-i)/(2m) with 0 < +'(p) < 1 since the second computed as I'(p) -1lZ", derivative of the log likelihood, Z", is negative at f and the variance of j O, cannot be smaller than '(l-k)I(2m). As another example, consider Z I 2In' = +Z' (38)

from formula (34) for which 1 +Z("&2/n'. I+= (39) This also shows how the value of n' can be used to modify the rate of convergence which is inversely related to the convergence factor. An increase in n' slows convergence down [i.e., increases the value of f'(,I)]. On the other hand, if n' is made so small that 4'(,) equals zero (i.e., n' = -Z"&2), then using this in the counting

COUNTING METHODS UNDER A MIXED MODEL

175

equation (38) leads to 4(ji.) = u - Z'Z" which is, not unexpectedly, the NewtonRaphson formula for a single parameter. In practice, -17" can be determined numerically in the course of the iterations. Setting this equal to oVn' for ,u [or to p(l -p)I(2m) for p, etc.] determines a lower bound for the value of n ' (or m) to be chosen for accelerating the iteration process. REFERENCES 1. MITTMANN 0: Vererbung durch ein Genpaar und Mitwirkung des Restgenotyps im statistischen Nachweis. Z indukt Abst Vererbungslehre 75:191-232, 1938 2. HARRuS H: Enzyme polymorphisms in man. Proc R Soc (Lond, Ser. B) 164:298-310, 1966 3. PENROSE LS: Effects of additive genes at many loci compared with those of a set of alleles at one locus in parent-child and sib correlation. Ann Hum Genet 33:15-21, 1969 4. BIRNBAUM A: The random phenotype concept, with applications. Genetics 72:739-758, 1972 5. WEISS V: Ein Algorithmus fur die Mendelanalyse quantitativer Merkmale. Biol Zentralblatt 93:311-323, 1974 6. ELSTON RC, STEWART J: A general model for the genetic analysis of pedigree data. Hum Hered 21:523-543, 1971 7. HARTLEY HO, RAO JNK: Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika 54:93- 108, 1967 8. CEPPELLINI R, SINISCALCO M, SMITH CAB: The estimation of gene frequencies in a random mating population. Ann Hum Genet 20:97-115, 1955 9. OTT J: Computer simulation in human linkage analysis. Am J Hum Genet 26:64A, 1974 10. OTT J: Counting methods (EM algorithm) in human pedigree analysis. Linkage and segregation analysis. Ann Hum Genet 40:443-454, 1977 11. DEMPSTER AP, LAIRD NM, RUBIN DB: Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc (Lond, Ser. B) 39:1- 22, 1977 12. OTT J: Linkage analysis with misclassification at one locus. Clin Genet 12:119-124, 1977 13. LANGE K, WESTLAKE J, SPENCE MA: Extensions to pedigree analysis. HI. Variance components by the scoring method. Ann Hum Genet 39:485-492, 1976 14. HENDERSON CR: A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32:69-83, 1976 15. QUAAS RL: Computing the diagonal elements and inverse of a large numerator relationship matrix. Biometrics 32:949-953, 1976 16. MORTON NE, MACLEAN CJ: Analysis of family resemblance. III. Complex segregation of quantitative traits. Am J Hum Genet 26:489-503, 1974 17. RAO DC, MORTON NE, YEE S: Analysis of family resemblance. II. A linear model for familial correlation. Am JHum Genet 26:331-359, 1974 18. RAO DC, MORTON NE, YEE S: Resolution of cultural and biological inheritance by path analysis. Am J Hum Genet 28:228-242, 1976 19. THOMPSON R: The estimation of heritability with unbalanced data. Biometrics 33:485-495, 1977 20. RAO CR: Linear Statistical Inference and its Applications. New York, J. Wiley, 1973 21. GERRARD JW, RAO DC, MORTON NE: A genetic study of Immunoglobulin E. Am J Hum Genet 30:46-58, 1978 22. SLONIM MJ: Sampling. New York, Simon and Schuster, 1967 23. MARSAGLIA G: Generating discrete random variables in a computer. Commun Assoc Comput Mach 6:37-38, 1963 24. IsAAcsoN E, KELLER HB: Analysis of Numerical Methods. New York, J. Wiley, 1966

Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigrees.

Am JHum Genet 31:161-175, 1979 Maximum Likelihood Estimation by Counting Methods Under Polygenic and Mixed Models in Human Pedigrees JURG OTT SUMMAR...
2MB Sizes 0 Downloads 0 Views