Letters to the Editor

1139

Am. J. Hum. Genet. 50:1139-1140, 1992

a, = f(M1IDD) = ml + D/p; a3 =f(MlIdd) =ml - Dq; a2 = f(MlIDd) = (a1 + a3)/2.

Estimating Linkage Disequilibrium from Conditional Data To the Editor:

We ignore the possible influence of selection and assume random mating. The nine expected genotypic frequencies can be described in terms of two unknown

When the strength of association between the frequencies of marker and disease loci are sought, it is often the case that samples are taken at random from within one or more disease genotype classes. The now-classic case is that of Kan and Dozy (1980), who demonstrated association of RFLPs to sickle cell anemia by sampling from within normal, carrier, and affected individuals. Estimation of population linkage disequilibrium from such conditional data sets was discussed by Chakravarti et al. (1984), and this note offers a more efficient means of estimation. It provides some justification for the approach suggested by Weir (1990). In brief, our method makes use of all the available data in a single calculation, and we show that there can be a substantial lowering of the variance of the resulting estimates in cases of unbalanced data. We consider the case of independent samples from each of three disease genotypes (table 1). Haplotype frequencies allow for linkage disequilibrium O) in the population and make use of the two known disease allele frequencies p = f(D) and q = f(d) and the two unknown marker allele frequencies ml = f(MI) and m2 = f(M2)

parameters, either min, OD or a1, a3, since we assume p and q are known. The likelihoods for the three samples are L(al, a3)DD

OC

(ai)xl(1 - al)2n -X,

L(al, a3)Dd

cr

(al1a3)n2l(a + a3 - 2axa3)n22 (1

L(al, a3)dd

o

L(a,, a3)

Numerical methods are needed to find the maximum-likelihood estimates of the a's, but the likelihood approach also provides variances of the estimates. By inverting the information matrix for this system, we find

DiJ

6101313/y, Var(63) =

=

63133113/17,

-n243133 313/y,

=

a1(1 - al), 13 = a3(1 - a3), 113

61

=

= al + a3 - 2ala3; + n2) - n241, 63 = 013(2n, + n2)-n2433,

013(2n3

Y =

Table I

n626 fl03133.

For the parameters ml, O) of primary interest,

Two-Locus Genotype Counts

.................

=

where

The frequencies of marker allele Ml conditional on the disease genotype are

dd

L(al, a3)DDL(al, a3)DdL(al, a3)dd

Cov(d1, 63)

f(DMI) = pm, + D f(DM2) =pm2 - D; f(dMv) = qm, - D f(dM2) = qm2 + D.

DD ................ Dd ................

OC

where xi = 2n1l + n12, X3 = 2n3l + n32. We differ from Chakravarti et al. (1984) by estimating the parameters from all three samples considered simultaneously. Since the samples are independent, the joint likelihood is just the product

Var(61)

GIVEN DISEASE GENOTYPE

a1 - a3 + aEa3)n23 (a3)X3(1 - a3)2n3-x3; -

MARKER GENOTYPES

MiMi

MIM2

M2M2

TOTAL

nil n21 n3

n2 n22 n32

n13 n23 n3n33

nli

n2

SOURCES.-Kan and Dozy (1980); Chakravarti et al. (1984).

Var(mz) = p2Var(61) + q2Var(63) + 2pqCov(d1, 63); Var (O) = p2q2[Var(61) + Var(d3) - 2Cov(d6, 63)]; while for the Yule's coefficient of association 4 defined (Nei and Li 1980) to be a function only of conditional marker allele frequencies

Letters to the Editor

1140 61

&l +

63

63

-

2&163

we find Var(4)

=

40i1,i3(0I381 +

013

+ 2n241f3)

age disequilibrium, even in the absence of information on phase. We suggest that the increased computational burden for our method is offset by the greater efficiency, with the two methods becoming more different as the data become more unbalanced. As soon as data are entered into a computer file, however, there seems to be no need to choose analyses on the basis of

computational simplicity. We performed a series of numerical comparisons of the variances of bt as estimated by our procedure and as estimated by the method of Chakravarti et al. (1984) which considers each of the three samples separately. We found the following trends in our results: our method becomes more efficient (i.e., has a smaller variance) as (a) n2 increases; (b) min(ni, n3) decreases; (c) n2 increases as a percentage of ni, n3; and (d) Ia3 - all increases. The variances of b are equal under the two approaches when both of ni = n3 and a, = a3 hold. In all cases, our method has the lower expected variance. We also considered estimation of ml, and we found the following additional situations where the efficiency of our estimate increases: (a) In3 - nlI increases, and (b) q decreases. As an illustration, we considered the data of Kan and Dozy (1980). These authors had sample sizes ni = 43, n2 = 86, and n3 = 58, and Chakravarti et al. (1984) took p = .95, and q = .05. Our method provides di = .0341, and 63 = .6789 so that mlI = .0664, b = .0306, and + = -.9671. Using the estimated a's we would estimate the SDs for these three estimates to be .0144, .0018, and .0160, respectively. The corresponding estimates from the method of Chakravarti et al. (1984) are di = .0465, and 63 = .7241 so that ml = .0741, b = .0322, and / = -.9635 with estimated SDs of .0207, .0022, .0198, respectively. The estimates do not differ substantially, but the variances can differ by as much as a factor of 2, as happens for 4 when ni = 25, n2 = 100, and n3 = 75andwhena = .1 anda3 = .7forq = .05. For D, the ratio of variances drops to .45 when n =n3 = 25, n2 = 100 and when al = .1 and a3 = .7 for q = .05. Less unbalance gives less discrepancy between the predicted variances of the two sets of estimates. We have shown that, when data are taken from several independent samples, it is better to base estimates of population genetic parameters on the whole data set instead of combining estimates from each of the separate samples. Basing estimates on independent samples of disease genotypes leads to estimates of link-

Acknowledgments This work was supported in part by NIH grants GM32518 and GM45344.

P. J. MAISTE AND B. S. WEIR Program in Statistical Genetics Department of Statistics North Carolina State University

Raleigh References Chakravarti A, Li CC, Buetow KH (1984) Estimation of the marker gene frequency and linkage disequilibrium from conditional marker data. Am J Hum Genet 36:177-186 Kan YW, Dozy AM (1980) Evolution of the hemoglobin S and C genes in world populations. Science 209:388-391 Nei M, Li W-H (1980) Non-random association between electromorphs and inversion chromosomes in finite populations. Genet Res 35:65-63 Weir BS (1990) Genetic data analysis. Sinauer, Sunderland, MA © 1992 by The American Society of Human Genetics. All rights reserved. 0002-9297/92/5005-0031 $02.00

Am. J. Hum. Genet. 50:1140-1142, 1992

A Frameshift Mutation (2869insG) in the Second Transmembrane Domain of the CFTR Gene: Identification, Regional Distribution, and Clinical Presentation To the Editor: The AF508 mutation accounts for approximately 55% of Spanish cystic fibrosis (CF) chromosomes (Estivill et al. 1989). Three other mutations are relatively common in the Spanish population-G542X (Kerem et al. 1990), R1162X (Gasparini et al. 1991), and N1303K (Osborne et al. 1991), which have frequen-

Estimating linkage disequilibrium from conditional data.

Letters to the Editor 1139 Am. J. Hum. Genet. 50:1139-1140, 1992 a, = f(M1IDD) = ml + D/p; a3 =f(MlIdd) =ml - Dq; a2 = f(MlIDd) = (a1 + a3)/2. Est...
255KB Sizes 0 Downloads 0 Views