The Power of Identity-by-State Methods for Linkage Analysis D. Timothy Bishop* and John A. Williamsont *

Imperial Cancer Research Fund, Leeds; and TDepartment of Mathematics, University of Colorado, Boulder

Summary The affected-sib-pair method has been widely utilized for mapping. This methodology is aimed at mapping complex traits which have been observed to be familial but for which Mendelian segregation, even after allowing for partial penetrance, is not apparent. Indications of linkage are based on the observation of nonrandom segregation at a marker locus in two affected siblings. We extend this methodology to more distant genetic relationships and examine the power of identity-by-state methods for mapping when marker information is only available on pairs of affected relatives. The power depends on the polymorphism of the marker, the probability of identity by descent at the trait locus, and the recombination fraction between the trait and the marker loci.

Introduction

The availability of a large number of DNA markers has made possible mapping projects with the certainty that if (a) a major gene exists for a trait, (b) the trait is reasonably homogeneous, and (c) there is sufficient family material available, then a linked marker can be found. While mapping of traits with a known genetic susceptibility and mode of inheritance is preferable, the geneticist will soon have the capability of testing the hypothesis of inherited susceptibility without making any assumptions about the mode of inheritance by examining the segregation of DNA markers to affected individuals. If an abnormal segregation is detected, then this is supportive of an inherited susceptibility. With the completion of the genomic map (Botstein et al. 1980), all regions of the genome can be explored to identify important DNA segments or refute significant genetic factors. The sib-pair approach was suggested by Penrose (1953) and is based on a comparison of the observed and expected identity-by-descent (i.b.d.) distributions at the marker locus. Modifications of this method have been subsequently applied, especially for HLA-related Received May 23, 1989; revision received October 9, 1989. Address for correspondence and reprints: Tim Bishop, Genetic Epidemiology Laboratory, Imperial Cancer Research Fund, 3K Springfield House, Hyde Terrace, Leeds LS2 9LU, England. o 1990 by The American Society of Human Genetics. All rights reserved. 0002-9297/90/4602-0005$02.00

254

traits (Day and Simons 1976; Haseman and Elston 1972; Smith 1953; Thomson 1986). Extensions have included examination of the i.b.d. distributions under two-locus models (Hodge 1981) and more distant relationships (Cantor and Rotter 1987). The power of this method for sibpairs has been investigated by Suarez et al. (1978), Blackwelder and Elston (1985), and Risch (1990a). Recently, Lange (1986a, 1986b) recognized the flexibility of identity-by-state (i.b.s.) methods and the importance of i.b.s. methods for markers which are not totally informative. He obtained the i.b.s. distribution for siblings. We extend these methods by considering i.b.s. distributions for more distant degrees of relationship. We also examine the power of a x2 goodness-of-fit test when susceptibility to the trait is inherited, since this will be the major application of these methods. We assume that information is only available on a pair of affected relatives in each independent pedigree. The test based on multiple affected relatives in pedigrees that has been devised by Weeks and Lange (1988), though more widely applicable than the test we introduce, is not a test on which power calculations can readily be made. With our simpler test, we are able to identify the precise way that various factors determine the linkage information obtainable from these sampling units. These factors include the polymorphism of the marker, the distance between the loci, the mode of inheritance of the trait, and the genetic relationship of the affected

Affected Relative Pair Methods

255

pair. The sampling unit of pairs of affected relatives represents the smallest unit useful for linkage, although clearly in many cases extra information will be preferable (and indeed available) to make mapping more feasible or likely. Readers should also consult Risch (1990a, 1990b, 1990c) for studies similar to those presented here and for various extensions. Methods

We consider a pair of relatives of known genetic relationship, both of whom are affected with the trait of interest. For both of these individuals, we have information on their phenotype at the marker locus, but no further marker information on the family is available. The marker locus is assumed to have n codominant alleles Al, A2, .. .., An, with known population frequencies pi, P2, .. ., Pn. Finally, we assume that there is no epistatic relationship between the marker locus and the trait; that is, we assume that there is linkage equilibrium between the marker and the trait and that the marker alleles play no role in determining trait expression. We examine the pairs of affected relatives for how their phenotypes at the marker locus deviate from what is expected under the assumption that the trait and the marker locus segregate independently: the null hypothesis. Let X be the number of marker alleles shared in common by the two affected individuals of a pair. We adopt the following conventions: if the individuals are (A1A2) and (A3A4), then {X=O}, (A1A2) and (A1A3) is an example of {X=1}, and (A1A2) and (A1A2) is an example of {X=2}. Further, when both individuals are the same homozygote, we consider that {X=2}, while (AlAl) and (A1A2) gives {X=1}. Any alleles in common may be either the result of i.b.d. or of separate copies of the same allele. Since these two situations are indistinguishable phenotypically, the distribution of X under the null hypothesis is a mixture of two or three component distributions, depending on the genetic relationship of the pair. The first component is the distribution of X conditioned on there being no alleles i.b.d. at the marker locus, the second on a single allele i.b.d., and the third on two alleles i.b.d. The resulting probability distribution of X is given by

P[X=0] P[X=1] P[X=2]

=

= =

koToo, koT1o + k1T1i , koT2o + kiT21 + k2T22,

where k- is the probability that the

two

(1)

individuals

have i genes i.b.d. at the marker locus for i = 0, 1, and 2; these are the familiar Cotterman k-coefficients (Cotterman 1940), except for ki, which is twice the usual value. Tij (i = 0,1,2 andj = 0,1,2) is the probability of i alleles i.b.s. when j alleles are i.b.d. at the marker locus. In terms of the allele frequencies, when Hardy-Weinberg equilibrium is assumed, these are given by Too = E) pip (1-pi_-p)2 +

Ep2 (1-pi)2

4E pipj2 (1-pi-pj) Til = Y pi (1-pi), T20 = 2 p2p2 +p4

4Ep3 (1-pi),

Tio

=

T21 = 4

+

(2)

pi2,

T22 = 1. We note that Too + T1o + T20 = 1 and also that T1i + T21 = 1. The formulas (2) are obtained by summing probabilities over all possible events, and i and j take all values from 1 to n. For instance, Too is the probability that two individuals who do not share an allele by descent also do not have an allele i.b.s., this is equivalent to the probability that two unrelated individuals do not share a copy of the same allele. The first term in the calculation of Too corresponds to the first individual of the affected pair being a heterozygote (AiAj) and the second individual having no alleles in common with the first individual; this latter event has probability (1- pi -pj)2. The second term corresponds to the first individual being a homozygote (AiAi) for an allele and the second individual not carrying that allele. The vector (ko,ki,k2) is defined by the genetic relationship of the individuals. We will use the term "unilineal" to denote those genetic relationships in which the two individuals can share at most one allele i.b.d. at each locus. For these relationships, k2 = 0 and ki is equal to four times their coefficient of kinship (4), since the coefficient of kinship is the probability that a gene chosen at random from the first individual is i.b.d. with a gene chosen at random from the second; ko = (1-k1). The only bilineal relationship that we will consider is full sibs. For full sibs, in the absence of inbreeding, ko = 1/4, k1 = 1/2, and k2 = 1/4. Examination of equation (1) for full sibs gives the formula as provided by Lange (1986a). To produce a statistical test of the null hypothesis that there is no relationship between the trait and the

256

marker locus, we compare the observed values of X with the theoretical distribution of X under the null hypothesis. We consider as a statistic the Pearson's X2 statistic for testing the goodness of fit between the observed and expected distributions of X. In some situations, a more powerful test is obtained by considering a likelihood ratio test using the actual distribution function for the observed marker genotypes. (We will develop this test more fully in a future article.) Expected Information Content and Power

The power of a statistical test is the probability of rejecting the null hypothesis as a function ofthe specific alternative hypothesis. It is computed from the distribution of the test statistic under that specific alternative hypothesis. Since the major reason for performing the tests described above is to attempt to locate a marker locus linked to the trait locus, we consider the power of these tests under these conditions. For our calculations, we assume that trait susceptibility is determined by a major locus, although other unspecified environmental factors may (and usually will) contribute to penetrance. To analyze these tests, we must obtain the distribution of marker phenotypes among the two affected relatives, under both the null and alternative hypotheses. We assume that the recombination fraction between the trait susceptibility locus and the marker locus is r. For r = .5 (corresponding to the case of no linkage), the distribution of X is given by formulas (1) and (2). When the same conditioning argument as employed to obtain formula (1) is used; the distribution of X under the alternative hypothesis is obtained by redefining ko, k1, and k2 to be, respectively, the probability of i alleles i.b.d. at the marker locus (i = 0,1,2), given the relationship of the two affected individuals and the trait mode of inheritance; the three probabilities ko, k1, and k2are now functions of the recombination fraction r and will be more appropriately written as ko(r), ki(r), and k2(r), respectively. In the Appendix we show that the distribution of X under both null and alternative hypotheses is given by P[X=O r] = ko(r) Too, P[X=1 r] = ko(r) T1o + ki(r) T11, (3) P[X=2 r] = ko(r) T20 + ki(r) T21 + k2(r) T22, where k,(r) is calculated from the mode of inheritance of the trait, the genetic relationship ofthe affected pair, and the recombination fraction r. When r = .5, the ki(.5) are again the coefficients defined in equations (1). The Tij are exactly as defined in equations (2).

Bishop and Williamson Deviations from the hypothesized distribution (i.e., with r = .5) among affected pairs of individuals suggest that there is some relationship between the trait and the marker locus. We shall pay particular attention to the situation in which the trait is determined by a major locus which is linked to the marker locus. Deviations in the distribution of X induced by linkage should then be reflected in excess sharing of alleles among affected pairs. While, under linkage, the number of pairs of {X=2} will always tend to be greater than expected and the number of pairs of {X=0} will tend to be less, the expected deviation in {X=1 } is not uniform in direction and depends on the allele frequencies at the marker locus. When the allele frequencies are approximately equal and there are three or more alleles, the number of pairs for which {X=1} will tend to be greater than expected when there is no linkage; but in situations where there are three or more alleles and widely differing allele frequencies, the number of pairs for which {X=1} will tend to be less than expected. For two alleles, the number of pairs for which {X=1} will tend to be less than expected (see the last section of the Appendix, Combining Cells in the x2 Test). We examine the information available to identify linkage by using two measures; the first is the power for a given alternative hypothesis. However, since for large sample sizes the power is unity for all situations, we introduce a second measure, based on Kullback-Leibler information (Kullback and Leibler 1951). If we were to use a sequential likelihood ratio test to test Ho:r = .5 against Hi:r = ri (with ri being less than .5), we would continue to sample, stopping and making our decision when and if the sum of the logarithms of the likelihood ratios achieves one of two boundary values. The expected sample size needed to attain the boundary is inversely proportional to the Kullback-Leibler information function, henceforth labeled ELOD (for expected LOD score). For this reason, choosing the test with the largest ELOD (i.e., with the largest expected log-likelihood difference) would seem appealing. However, this is not the same as choosing the test with the most power, though, in most cases, the test with the largest ELOD also has, for a range of sample sizes, the greatest power. The ELOD, as a function of r, is then defined as i Z P[X=i I r] ln(P[X=i r]IP[X=i r=.5]). =0,2 For the power calculations, we need to specify completely the statistical test. The test for affected sibpairs is usually based on the trinomial distribution for X, but, for unilineal relationships, most of the essential information to compare hypotheses is contained in the dichotomy of X values {X=0, X>0} or {XO r=.S] I r=ri], eli = P[X>O r=rl]

This power can be examined numerically to determine appropriate sample sizes for analysis. Figure 7 shows the power for grandparent-grandchild pairs for three sample sizes (s = 50, 100, and 150) and for two recombination fractions (rl = .05 and ri = .20) corresponding to the two reasonable extremes of the possible marker densities for a genomic linkage search. Since the important factor in determining the power of the test is the probability that the two affected individuals are i.b.d. at the trait locus (r), we graph the power against p for a four-allele marker system with equally frequent alleles. On the same axis, we note the gene frequencies for both a partially penetrant dominant trait 1.0-

s=15Q, r=O.05 1 0,r020 sr00-

0.8-

r-O. r

0.6-

s=100, r-0.20

3: ° 0.4-

s=50r-05 s-50, r-0.20

0.2U '.

I

_

1L 0.70 DOM 0.1 38 REC 0.429

0.75

0.10I

0.333

0.80 0.072 0.250

0.85 0.048 0.176

I

I

0.90 0.029 0.111

0.95 0.014 0.053

1.00 0.00 0.00

Figure 7 Power to map as a function of s, the number ofgrandparent-grandchild pairs, for recombination fractions r = .05 and .20 for a four-allele marker system; all alleles are equally frequent. The significance level is .05.

261

Affected Relative Pair Methods and a partially penetrant recessive trait that would imply that value of A. It should be noted that the scale is not linear in gene frequency. For a rare gene, either dominant or recessive, a sample size of 50 pairs is sufficient to map this trait with high probability for a recombination fraction of .05, although a recombination fraction of .2 implies a power of approximately .6. For the larger sample sizes and either recombination fraction, sufficient power to map the trait is obtained. For higher gene frequencies, the power decreases as the g decreases, although, for a dominant gene with frequency as high as .14, a tightly linked marker (r = .05) can still be mapped with s = 150 pairs and greater than .6 power. We also computed the power of the uncle-nephew and half-sibs relationships (results not shown), which we already know are less powerful than the grandparentgrandchild relationship. The distinction in power between the grandparent-grandchild relationship and these others is most noticeable for higher recombination fractions; when r = .05 in circumstances where the grandparent-grandchild relationship has power less than .95, this relationship has two- to threefold higher power than uncle-nephew and half-sibs relationships, while for r = .05 the difference is limited to 30%. In the range of interest, the half-sibs relationship has approximately 10% more power than the uncle-nephew relationship. These differences are not qualitatively different with varying sample sizes and polymorphism of the marker. Figure 8 shows the power curves for first cousins. The scale, again in terms of A, has been modified to give the same range of gene frequencies as given in figure 7. For higher values of A, the cousin pairs show a higher power than uncle-nephew pairs, but the order is reversed when the gene frequency is more common. At the lower end of this range, the cousins have a reduced A, and, in fact, only on the order of one half the cousins will be i.b.d. In this situation, closer relatives are preferable, since they are more likely to share a copy of the trait gene. Utility of Information on Other Relatives

While the incorporation of trait information on relatives of the affected pair is impossible without a complete likelihood analysis, information on the marker types of relatives can be used to modify i.b.d. probabilities at the marker locus. A complete, general discussion of this possibility is not feasible, so we concentrate on the extreme case, which is the case where i.b.d. at the marker locus can be determined unequivocally. This would be the case where the marker was highly polymorphic and where marker information on all relatives

s=150, r=0.05 1.0-

s=100, r=0.05

0.8cr

s=50, r=0.05

0.6-

3: ° 0.4-

s=150, r=0.20

0.2o

s=100, r=O.20

s-50, r=O.20 1.

0.70 DOM 0.039 REC 0.143

0.75 0.029 0.111

0.80 0.022 0.083

0.850.015 0.059

0.90 0.009 0.037

0.95 0.004 0.018

1.00 0.00 0.00

Figure 8 Power to map as a function of s, the number ofcousin pairs for recombination fractions r = .05 and .20 for a four-allele marker system; all alleles are equally frequent. The significance level is .05.

in the path connecting the two affected individuals was available. In this case, we would base our test on the number of pairs in the sample i.b.d. at the marker locus. The ELOD for this comparison is formally identical to the ELOD based on X with an infinite number of alleles at the marker locus (fig. 3). This simple model produces the highest ELOD that can be obtained from pairs of affected relatives. Thus the ELOD from a marker that has a polymorphism corresponding to n = 8 has about one-half the ELOD available from the fully informative marker described above. The sampling of the intermediate relative plus the spouse of the affected grandparent can then, at most, provide twice this information. Of course, even for multiple alleles, there will be many cases in which the exact i.b.d. state cannot be determined, because of either the lack of polymorphism in the particular sample or the mating of two identical heterozygotes. Discussion The affected-pair method is an appealing design for linkage analysis for a trait that is familial and perhaps even genetic, but no mode of inheritance is apparent when pedigrees are examined. If a mode of inheritance is suggested, then "classical" linkage analysis clearly is preferred. In this research, we have examined the utility of i.b.s. methods and have shown that several factors have a major influence on the power. These factors are the relationship of the affected pairs, the polymor-

phism of the marker, the recombination distance between a trait locus and the closest marker, and the mode of inheritance of the trait. The overall conclusion of these analyses is that, if a trait is homogeneous and sufficient pairs are available, then mapping is feasible. However, defining the appropriate sampling units for

262

mapping is extremely confusing. The complex interaction of the above factors prohibits simplistic statements about efficiency when a linkage study is initiated. Probably the clearest conclusion about mapping is that, for a partially penetrant recessive trait, affected sib pairs are the most informative, while for a trait with dominance, other relationships may be considerably more informative. For a dominant trait, the most efficient sampling unit does vary with the allele frequency of the trait and the marker polymorphism, so that for a common trait allele or markers low in polymorphism, sibs will still be the unit of choice. The examination of segregation patterns in pedigrees, before linkage studies, is therefore necessary to indicate whether the trait has evidence of dominant or recessive inheritance. While the relationship that will provide the most linkage information depends on unknown genetic parameters, several other generalities can be stated. The first is that, of the three relationships grandparent-grandchild, uncle-nephew, and half-sibs, the grandparent-grandchild relationship is always the most powerful. Thus, by choice, of these relationships, grandparent-grandchild pairs should be selected. Unfortunately, for many traits, grandparent-grandchild pairs may be the least available of these relationships. Under many conditions, the uncle-nephew, half-sibs, and cousin pairs have similar levels of information on average. The exception is the tight linkage or candidate-gene linkage study of a rare trait allele, where distant relationships can be more powerful even though many of those pairs will not be i.b.d. at the trait locus. Thus, although we do not expect these affected pairs all to be identical at the marker locus, the distortion in the i.b.d. distribution will be the greatest for the more distant relationships. However, this distortion depends on the allele frequency of the trait; for more common trait alleles, only close relatives should be included in the sample. As for all linkage studies, the polymorphism of the marker is important, with higher polymorphism implying greater information. Another factor that has considerable impact on the linkage information is the frequency of phenocopies; table 2 shows that even low-phenocopy rates can have disproportionately large effects. While we have concentrated only on pairs of affected individuals, it is clear that this sampling unit will be the one impacted most by phenocopies. Subunits of highly loaded families for the trait are more likely to carry the trait allele and hence be more informative for linkage. In general, the power of these methods decreases rapidly with the recombination distance. For a genomic search, this implies that markers should be approxi-

Bishop and Williamson mately a recombination distance of .2, so that the trait is never greater than a .10 recombination distance from the closest marker. The variation in linkage information by recombination fraction is shown by examination of the sample sizes required for reasonable power. For instance, for first cousins and a trait determined by a rare dominant gene plus a linked marker with four equally frequent alleles at a distance of .05 recombination frequency, 50 pairs is sufficient to map with power of .8 or greater, while at a .20 recombination frequency 150 pairs only gives power of .5. We note that the sample sizes shown here are considerably smaller than those given by Risch (1990a). The major reason is that we have considered a significance level of .05, which, as pointed out by Risch (1990a) is less stringent than the standard 3.0 LOD score requirement. Our approach is that these analyses are by necessity exploratory, since there is limited evidence for a major gene in these families; if a significant result is obtained, then confirmation will be required. The use of affected relatives for mapping represents the smallest sampling unit that can be informative for linkage. The calculations performed show that, unless the markers are highly polymorphic, there can only be minimal evidence for linkage; that is, unless i.b.d. can be inferred if not with certainty then with high probability, no detection of linkage will be possible. This point has been stressed by Risch (1990a) and is the important insight required for deciding who among the other relatives it is important to phenotype (for the marker locus). Affected pairs for which i.b.s. can be replaced by absence or presence of i.b.d. at the markers locus are particularly informative. Thus, although typing unaffected individuals will not give any insight into the segregation of the trait susceptibility locus, they may nevertheless be necessary for informative linkage studies. We have shown that the power to map depends on A. Risch (1990a) has pointed out that in practical situations j can be estimated from the following relationship for unilineal relatives (which is rewritten in the notation of the present paper): p = 4+X/(4+X+1- 4X), where X is the ratio of the risk to a relative of the proband i.b.d. for a single allele at the trait locus to the risk to a member of the general population. The X is estimated from the frequency of the trait in a parent (or an offspring) of a case, compared with the frequency of the trait in the general population. While we have concentrated on traits determined by a single locus, the above analysis is straightforward to complete for traits which involve multiple loci; the power simply depends on the deviations induced in the probability of i.b.d.

Affected Relative Pair Methods at the marker locus by the trait locus or loci in the region of the marker locus.

Acknowledgments This research was supported partially by NIH awards CA36362 and CA-28854. The authors would like to express their gratitude to Neil Risch for his numerous helpful comments and discussions related to this research and for sharing his manuscripts, which presented research similar to that given here. The authors would also like to thank Sir Walter Bodmer, Cathy Falk, Stephanie Sherman, and Mark Skolnick for discussions pertinent to this research. Special thanks are also due to Michael Boehnke for detailed comments on an earlier version of the manuscript.

Appendix Mathematical Preliminaries for the Calculation of Information Content and Power

In this Appendix we detail the mathematical formulation of the ELOD and the power calculations. Again, we assume we sample sets of pairs of affected related individuals of defined genetic relationship. For notation, we define the following events: R = the genetic relationship of the two individuals; k = the coefficient of kinship for relationship R; A = the event that the two individuals are affected; and G = the true mode of inheritance of the trait. We also define two random variables: ibdT = the number of genes i.b.d. at the trait locus for the two related individuals; and ibdM = the number of genes i.b.d. at the marker locus for the two related individuals. Thus both ibdT and ibdM may take the values 0, 1, and 2. With this notation, ki(r) = P[ibdM = i R, A, G, r] for i = 0, 1, and 2. For r = .5 and in the absence of any epistatic effect, as noted previously, the ki(r) are essentially the Cotterman k-coefficients. 1. B. D. at the Marker Locus

The marker i.b.d. distribution is obtained by conditioning on the i.b.d. status at the trait locus for the two related affected individuals. We now require ki(r) = P[ibdM = i R, A, G, r], i.e., the probability that the two affected related individuals are i.b.d. for i alleles at the marker locus when the mode of inheritance of the trait is given (i = 0,1,2). For convenience of notation, we will not write R and G in the conditional probability. All probabilities are, however, still conditional on these events. Under these conditions, P[ibdM = i A, r] = Z P[ibdM = i A, ibdT =jr] P[ibdT =

263

j [ A]. The probabilities P[ibdT = j [ A] (j = 0,1,2) depend only on the mode of inheritance of the trait. These are obtained in the following sections. If we assume no epistasis between the marker and the trait, then the probabilities P[ibdM = i A, r] (i = 0,1,2) are obtained from P[ibdM = i A, ibdT = j, r] = P[ibdM = i ibdT = j, r] = dij(r). This latter term is simply the probability of i genes i.b.d. at a second locus (the marker), conditional on j genes i.b.d. at the first locus (the trait locus). Finally, we have, for i = 0, 1, and 2, k-(r) = P[ibdM = i A, r] = Z dij(r) P[ibdT = j | A]. 2. l.B.D. at the Trait Locus

The pairs of affected relatives will be i.b.d. at the trait locus with a probability that is determined by the true mode of inheritance of the trait. While this will be unknown in any practical situation, we can examine the range of values for this probability by considering several simple situations. We expect that traits with modes of inheritance for which the affected relatives are almost certainly i.b.d. at the trait locus will be more readily mapped, since we expect to see greater distortions in the marker sharing frequencies. The calculation of the i.b.d. probability is based on an application of Bayes's theorem. We require the probability that two affected related individuals be i.b.d. at the trait locus. Then for j = 0, 1, 2,

P[ibdT =jI A] =

P[A ibdT = j] P[ibdT = i] Ej P[A ibdT = i] P[ibdT = i]

i=0,2

Now, for unilineal relatives, P[ibdT = j] = (1-4X) (for j = 0), 445 (forj = 1), and 0 (forj = 2), while the probabilities of affection depend directly on the genetic model. Model 1. -Suppose G is a dominant trait, determined by a single locus with two alleles (B,b); the penetrance of the genotypes BB and Bb is t, while the penetrance of the genotype bb is 0. Suppose also that the two affected individuals have a unilineal relationship. Let p be the frequency of B. Then, by arguments similar to those presented previously, P[A ibdT = 1] = pt2 + (1-p)p2t2 and P[A ibdT = 0] = Q2t2, where Q = [1-(1-p)2] is the probability that a random individual is susceptible. Finally,

P[ibdT = 1 A] =

(p+p2_p3) 44

(p+p2 -p3)40+Q2(1- 4q5)

264

Bishop and Williamson

We notice that this probability is independent of the penetrance, since we sample only affected individuals. If the genotype bb could also express the trait, then this probability would be a function of the ratio of the two penetrances (see model 3 below). In the absence of this, figure 1 shows the probability as a function of the allele frequency p. As expected, for small p, the individuals are, with high probability, i.b.d. Model 2.-Model 2 is a recessive model which is identical to model 1, except that the BB individuals are affected with probability t and that Bb and bb individuals are affected with probability 0. A similar calculation shows

P[ibdT

=

1 1 A]

44)

= 44)

+

(1-44))p

linked loci for these relationships are distinct (Thompson 1986). Figure 2 shows the various forms of dil(r) as a function of r for various degrees of relationship from table 1. For unilineal relationships, the conditional probabilities may be obtained by considering

P[ibdM = i ibdT = j, r] P[ibdT = j I ibdM = i, r] P[ibdM = i] P[ibdT

P[ibdT

=

1 IA]

P[A ibdT

=

=

1] 4)

1] 4) +

(1-44)f2

where P[A ibdT = 1] = t2[p+q(p+qx)2]. When x = 0, this reduces to the formula for model 1. Again, as in models 1 and 2, this probability is independent of t.

I]

If i = 1 and] = 0, then P[ibdM = 1 ibdT = 1,r] = dil(r) and

dio(r) = P[ibdM

= 1

ibdT = O,r]

- (1-P[ibdT = 1 | ibdM = 1,r] 44) (1-44) - (1-P[ibdM = 1 | ibdT = 1,r]) 4)

Again, figure 1 shows p as a function of the frequency of B. Model 3.-Model 3 is the same as model 1 except that bb individuals express the trait with probability xt. The population frequency of the trait is J (= t(p2+2pq+q2x)), where p is the frequency of the allele B and q = (1-p). Then

P[A I ibdT

=

(1-44) - [1-dui(r)] 4)

(1-40) For these relationships we can thus compute all of the required conditional i.b.d. probabilities. Bilineal conditional probabilities can be similarly evaluated, although they require more direct calculation than what is presented here. 4. Unilineal Relationships

For unilineal relatives, an explicit formula for the

3. I.B.D. for Two Linked Loci

ki(r) can be obtained. We define g to be the probabil-

The i.b.d. probabilities for two related individuals and two linked loci are obtained by considering all genotype paths connecting the two individuals (Campbell and Elston 1971; Lange 1974; Thompson 1986). These functions, dij(r) (for i andj = 0, 1, and 2), for various common relationships are contained in table 1. For unilineal relationships, individuals may only be i.b.d. for a single allele at each locus, implying that d2j(r) = O for j = 0, 1 and that k2(r) = 0 for all r. The formulas for the more distant cousinships may be obtained by multiplying the d1 (r) probability by a factor (1-r) for each segregation below first cousins, so that, for instance, the formula for second cousins involves (1- r)2. We note that while dii(.S) is equal to .5 for three sets of relatives (half-sibs, grandparent-grandchild, and uncle-nephew), the probabilities of i.b.d. at each of two

ity of a gene i.b.d. at the trait locus for the two related affected individuals, i.e., p = P[ibdT = 1 A]. Because unilineal relatives can only share, at most, one gene at each locus, the overall probability of a gene i.b.d. at the marker locus is ki(r), where

ki(r) = P[ibdM = 1 A,r] = E P[ibdM = 1 ibdT = j,r] P[ibdT = I A] j=O,1

diid(r)

(1- i)dio(r) = + (1- p)40[1-dii(r)] (1-44) - 4) (1- g) + dil(r)(Ji-44) =

+

jdll(r)

(1-44)

(4)

Affected Relative Pair Methods We notice that for p = 1, P[ibdm = 1 1 A,r] = dii(r) as expected, and that for r = .5, dil(r) = 4c and, therefore, ki(.5) = 44, again as expected. 5. Combining Cells in the x2 Test Often for unilineal relatives we will wish to combine two of the cells from the trichotomy {X=0, X=1, X=2}

to produce a test based only on a dichotomy. Intuitively, the dichotomy {X=O, X>O} would seem the natural reduction, since it reflects the existence of allele sharing among the affected pair. However, the derivative of P[X=1] (in eq. [3]) with respect to r is equal to (Ti1l -Tio)dki(r)/dr. Since the second term is always negative (see eq. [4]), the probability is increasing as a function of r if T1o > Til and is decreasing if the reverse inequality is true. Algebraically, formula (2) shows that, for n = 2, T1o is always at least as big as Til, while for n > 2 the direction of inequality varies with the allele frequencies. When the alleles are equally frequent (and n > 2), Til > Tio; when there is wide discrepancy in the allele frequencies, the reverse may be true. The importance of this result is that, since P[X=2] is decreasing as a function of r and since P[X=O] is always increasing, the {X=1} group should be paired with the group which is expected to be deviant in the same direction. Failure to ensure this will counteract the magnitude of the deviations.

References Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease susceptibility loci. Genet Epidemiol 2:85-97 Botstein DR, White RL, Skolnick MH, Davis RW (1980) Construction of a genetic map in humans using restriction fragment length polymorphisms. AmJ Hum Genet 32:314-331 Campbell MA, Elston RC (1971) Relatives of probands: models for preliminary genetic analysis. Ann Hum Genet 35:225-236 Cantor RM, Rotter JI (1987) Marker concordance in pairs

265 of distant relatives: a new method of linkage analysis for common diseases. Am J Hum Genet 41:A252 Cotterman CW (1940) A calculus for statistico genetics. PhD thesis, Ohio State University, Columbus. Published in P. Ballonoff(ed) (1975) Genetics and social structure. Benchmark papers in genetics. Academic Press, New York Day NE, Simons MJ (1976) Disease susceptibility genestheir identification by multiple case family studies. Tissue Antigens 8:109-117 Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:3-19 Hodge SE (1981) Some epistatic two-locus models of disease. I. Relative risks and identity by descent distributions in affected sib pairs. Am J Hum Genet 33:381-395 Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79-86 Lange K (1974) Relative to relative transition probabilities for two linked genes. Theoret Pop Biol 6:92-107 (1986a) The affected sib-pair method using identity by state relations. Am J Hum Genet 39:148-150 (1986b) A test statistic for the affected-sib-set method. Ann Hum Genet 50:283-290 Penrose LS (1953) The general purpose sib-pair linkage test. Ann Eugenics 6:133-138 Risch N (1990a) Linkage strategies for genetically complex traits. I. Multilocus models. AmJ Hum Genet 46:222-228 (1989b) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am J Hum Genet 46:229-241 (1989c) Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs. Am J Hum Genet 46:242-253 Smith CAB (1953) The detection of linkage in human genetics. J R Stat Soc [B] 15:153-192 Suarez BK, Rice J, Reich T (1978) The generalized sib pair IBD distribution: its use in the detection of linkage. Ann Hum Genet 42:87-94 Thompson EA (1986) Pedigree analysis in human genetics. Johns Hopkins University Press, Baltimore Thomson G (1986) Determining the mode of inheritance of RFLP-associated diseases using the affected sib-pair method. Am J Hum Genet 39:207-221 Weeks DE, Lange K (1988) The affected-pedigree-member method of linkage analysis. Am J Hum Genet 42:315-326