FISSION MODELS OF POPULATION VARIABILITY E. A. THOMPSON King’s College, Cambridge, CB2 I S T , England Manuscript received March 26, 1979 ABSTRACT

Most models in population genetics are models of allele frequency, making implicit or explicit assumptions of equilibrium or constant population size. In recent papers, we have attempted to develop more appropriate models for the analysis of rare variant data in South American Indian tribes; these are branching process models for the total number of replicates of a variant allele. The spatial distribution of a variant may convey information about its history and characteristics, and this paper extends previous models to take this factor into consideration. A model of fission into subdivisions is superimposed on the previous branching process, and variation between subdivisions is considered. The case where fission is nonrandom and the locations of like alleles are initially positively associated, as would happen were a tribal cluster or village to split on familial lines, is also analyzed. The statistics developed are applied to Yanomama Indian data on rare genetic variants. Due to insufficient time depth, no definitive new inferences can be drawn, but the analysis shows that this model provides results consistent with previous conclusions, and demonstrates the general type of question that may be answered by the approach taken here. In particular, striking confirmation of a higher-than-average growth rate, and hence smaller-than-previously-estimated age, is obtained for the Yan2 serum albumen variant.

M O S T models for the analysis of genetic variability in populations make implicit or explicit assumptions of equilibrium or constant population size. While these assumptions may provide an adequate first approximation for many populations, they probably will not do so for the American Indian tribal populations of South America. These tribes, and villages within tribes, have arisen as the result of a fission and fusion process; villages come into being and disappear, and, on a larger time scale, presumably so also do tribes (NEELand WEISS1975). Tribes may grow rapidly from a single village, or may undergo disasters that decimate them. The genetic variability we observe is the result of complex demographic processes, and chance social and demographic factors may have a profound influence on the genetic constitution of a whole tribe. Of particular interest are the “private” variant alleles of the South American Indians-variants each present in only a single tribe, some of which have achieved polymorphic frequencies in that tribe (NEEL1978). For such variants in such populations, a branching process model for the number of variant copies seems more appropriate than a diffusion process or migration model of allele * Research done while visiting the Department of Human Genetics, University of Michigan, supported by National Science Foundation grant RMS-74-11823. Genetics 9 3 : 479-495 October, 1979

480

E. A.

THOMPSON

frequency variability. Branching process models make other assumptions, of course, but questions of population size and equilibrium at least are avoided (THOMPSON 1976). It is therefore of interest to develop these models and to investigate whether they provide alternative interpretations of observed data. THOMPSON (1976) discussed the estimation of the age of a variant allele on the basis of a branching process model with geometric offspring distribution, and suggested (THOMPSON 1977) that the spatial distribution of such a allele might provide further information on its age and rate of increase. (KEIDINGand NIELSEN (1975) have shown that the geometric offspring distribution allows explicit tth generation distributions to be obtained, even in the case where the reproductive parameters vary over time, and this has been applied in an analysis of “founder effect” in the private polymorphisms of South American Indians (THOMPSON and NEEL1978; NEELand THOMPSON 1978). Here we return to an analysis of spatial variability under a branching process model of allelic replication and a fission model of population structure. Our primary interest is in a rare variant, which may be assumed to have arisen as a single mutation and which is present in sufficiently low frequency for a branching process model of independent variant replications to apply. However, our inferences regarding such variants are dependent upon a demographic history, the details of which are unknown. It is therefore of interest to consider whether other genetic data can provide information about this history. In addition to the private polymorphisms, there are several other red cell polymorphisms whose tribal distributions have been studied (SMOLJSE and NEEL1977) and in which one allele has a sufficiently low frequency for a model of independent replications to apply. We shall of course not consider estimating the ages of these alleles; they predate the American Indian. However, their tribal spatial variability provides information about demographic parameters, which may then be useful in an analysis of the rare variant of primary interest. A N EVOLUTIONARY TREE MODEL FOR TRIBAL FISSION

We shall consider a variant allele that arose as a single mutation togenerations ago, and thereafter replicated according to a branching process. At this juncture, we make no further assumptions about this process, but simply denote the probability generating function of the number of replicates of a single original gene after t generations by g t ( z ) . The generating function of the current total number of replicates of the variant allele is thus gto(2).As in previous papers, our model is one for the replication of this allele alone; the total population size and the numbers of other alleles do not enter into the treatment. We shall also suppose that the population has some infrastructure, as the result of fissioning at certain points in history (Figure 1 ) ; the tree structure and the times of split are assumed known. In practice, populations do not usually split suddenly, but become gradually differentiated, with some exchange of migrants continuing to the present day. I n our case, however, we shall see that the infrastructure of the population is well represented by a tree model of instantaneous

481

POPULATION VARIABILITY

I'

FIGURE 1.-Evolutionary

tree, showing gene numbers at times of fission. For details, see text.

fission. Suppose that over the history of the allele there have been k fissions in populations descended from the one carrying the original mutation, the ith occurring ti generations ago (1 < i < k,tr, < tL1 < . . . < tl < s,) . Suppose that the ith split occurred in a subpopulation then carrying ni copies of the variant, and that it split into ii components carrying mij variant copies (1 < i Q ;Ti, 1 < i < k). Again, it is not necessary to define the splitting process precisely at this stage; hi will denote the probability generating function for the number of variant copies in each subdivision of the ith split, immediately subsequent to that split, conditional upon ni;

We shall denote the ri subtrees with their roots at the ith split by Sij (1 < i Q ii) , and the probability generating function for the numbers of current replicates descended from a single replicate at the origin of the subtree Sij by g* (.lS,j). The total tree will be denoted by So. Then, we may derive recursively the generating function g*(.IS,) for the currently observable numbers of replicates N l (1 < I < L ) in all the L current subpopulations. Since subtrees evolve independently, subsequent to fission, repeated conditioning gives

('lsii),.

g* (.]So)= E [ h i ( g *

*

,g*

(.lS1vl)

I nlI,

(2)

where for convenience of notation we suppress the explicit variables zi of the generating functions g*.

482

E. A. THOMPSON

In practice populations do not split randomly, but rather along familial or geographic lines (NEEL1967; FIX 1975), and a model to incorporate this effect will be derived in the following section. However, in the simplest case, let us assume independent assortment of the variant alleles at the ith population fission. If each variant allele has probability qij of going to the jth subpopulation ( 1 < j < ri) ,then hi is the multinomial generating function

In this case (2) may be rewritten

Similarly, assuming that the population of subtree SIJ is the next to fission, we have

and so on down the tree. For any “tree” that is a single population and does not fission subsequent to its formation at the ith split, we simply have

Thus equations (4)to (6) enable a general expression for the generating function, g* (.[So), of the current observations conditioned upon a known tree structure to be derived. A recursive formula for the probability generating function implies similar formulae for the moments of the distribution. Differentiating (4)and setting z = 1 (i.e., all components z j = l ) , we have for a subpopulation I in Sij

E ( N z ) =eol q 1 j elj(4,

(7)

where e,, = E ( n , ) = g’rt,-t (1) is the expected number of replicates at the time of the first split, and elj( I ) is the expected number of descendants in subpopulation I of a single variant at the root of the tree S,i. Similarly, if 1 is a subpopulation in S2j,, (and hence in the notation of ( 5 ) of s,,),then elJ(I> = e124.j’ ezj,(Z), where elz= g’It2-t,1(1). The recursive formulae for the variances and covariances are given by the second derivatives of the generating function: j?

Var(Nt) = V,i(Z>qlj e,,

+ [pli(1)l2 [qljzU , , + qli(l - q l j ) eoll

(8)

where V,j (2) and U,, are the variances corresponding to expectations elj ( I ) and e,] respectively. For two subpopulations I in Sljand P in S,j, cov(N&’t,) = elj ( I ) elj,(1’) COV( mlj:mlj,)= - e,j (I) elj,(I’)qljqlj, ( e , - U , ) if jZj’, (9) {=Clj(1,P) E ( m , j ) e,j(l>elj,(z’> var(m,j) = eolqljCli(I,Z’) e,j (2’) [ql? U,, qli (1- qli)eoll if j=i’

+

+

+

POPULATION VARIABILITY

483

where Clj (Z,l’) is the covariance contributed by a single variant gene a t the root of Slj. Thus, all second moments may be expressed in terms of contributions from genes present at the first split, and we may again work recursively down the tree. FISSION INTO CLUSTERS

The application of a tree model to an analysis of rare-variant replicate numbers requires a depth of knowledge of population history that will rarely be available to us. W e may, however, have several hierarchic clusters of populations, and a special case of the tree model arises when a tribe fissions simultaneously into several subpopulations, which thereafter evolve independently. In this section, we shall consider a single such split, and derive results conditional upon n, the number of variant copies at this time of fission. Since we make no assumptions about the history of the variant prior to the split, the results of this section are applicable to any allele that, at the split and subsequently, has been at sufficiently low frequency for a model of independent replications to apply. We shall assume that the fission occurred t generations ago. and to avoid cumbersome terminology shall refer to a single gene existing in any subpopulation immediately subsequent to this split as an original gene. The generating function of the current number of replicates of any original gene is thus gt. Condition31 upon n and upon independent assortment of the variant genes, we have now simpler forms for ( 7 ) , (8) and (9) ;

E(Nz)= n qt e Var(N2) = n

U

(7’)

qz

+ n q~ (1 - q ~e2)

cov ( N z . N z r = ) - n e2 q zql,

and

,

(8’) (9’)

where 41 is now the probability of an allele being assigned to the Lth subpopulation, e is the expectation gt’(l), and U the variance gt’’(l)+g{(l)-[gt’(l)]2 of the number of replicates produced by an original gene. In practice, tribes and clusters do not split randomly, but are partially differentiated before fission, due to geographic variability. Villages also do not split randomly, but probably on familial lines. There will thus be a tendency for like alleles to be destined for the same subgroups, and we now consider an extension to the previous model that will incorporate this effect. Let us label the n variant copies existing at the time of fission and write 1if the jth variant copy goes to the ith subpopulation 0 otherwise.

Then mi = En Xii is the number of original genes in the ith subpopulation, and /=I \-. m. =\-I X j X i j = n , E(Xij) = q i and var(Xij) = q ; (1 -4i). For a random assignment of genes to subpopulations, the correlation between X i j and XiSj, is zero for j # j’.

- L

I

484

E. A.

THONPSON

For a nonrandom assignment, however, suppose that this correlation is now 0. [We note that for such zero-one variables, 1 > p > max{-qi/(l-qi)}.] Then, cov(Xij, Xip) = qi (l-qi) p, and hence p#

P ( X i j . = 1,

xij = 1) = E(XijXijf) =q i 2

+ COV(Xij,Xij.) = + qi ( l - q i ) qi2

p.

(10)

Thus,

P ( X i j , = 1 I xij = 1 ) = qi

+ (I-qi)

p,

while

Thus, by symmetry, we have

P(Xi.j. = 1 I xij = 1) =qi, (l-p) E(XifXi3j.) = q i q i r (l-p) (11) and cov(Xij, Xi,,,) = -p qi qi, . If the correlation p is positive, then the assignment of any allele to a particular subpopulation increases the probability of the same assignment for any like allele. The resulting distribution thus has higher variability among the {mi; 1 < i < r } than the no-association multinomial assignment. If p is negative, the variability among the {m,}is expected to be smaller than under the multinomial model. The precise joint distribution of the { m i }is not important for the present analysis, but it is of interest to demonstrate a distributional model that gives rise to the above symmetric associations of equation (1 1), expressed in terms of the single parameter p. Such a model is provided by the multivariate Polya urn, a model of “contagion” (1968, p. 110) and in more detail by JOHNSON and KOTZ discussed by FELLER (1977). This model provides for a situation in which knowledge that one individual is “affected” increases the probability that a subsequent one will also be. We have the situation in which knowledge that an individual carrying a certain allele has been assigned to a certain subdivision increases the probability that a subsequent such individual will be destined f o r the same subdivision, since individuals carrying like alleles are more likely to be closely related. Suppose we have in an urn s balls, of which sqi are of color i ( 1 < i < r ) , one color associated with each of the r subpopulations to be formed. Suppose also we have placed our n variant replicates, which are to be assigned to these subpopulations, in some (random) order. A ball is drawn at random from the urn, and the first variant replicate is assigned to the subpopulation associated with the color of the ball drawn. Thus the variant goes to the ith subpopulation with probability qi. The ball drawn and c additional balls of the same color are added to the ~111. The process is repeated (n-I) more times to assign each of the remaining (n-1) variant copies. If c=O, the probability that a gene goes to population i is qi independently for each, and we have a model of independent assignments. For a

POPULATION VARIABILITY

485

positive c, we shall have in our urn extra balls of colors associated with subpopulations that have been the destination of previous genes. Thus, the next gene has an increased probability of going to such a subpopulation. At every stage, the current constitution of the urn gives the current destination probabilities, dependent upon the destination of previously assigned variant copies. The larger c, the larger the positive association between variant-copy destinations, but the overall probability for each single copy is qi for subpopulation i. With p=c/ ( c f s ), we have the covariance structure of equations (IO) and ( 1 1). W e note that if we have several different allelic types of variants, which under a branching process replicate and migrate independently, we may assume that each has its own urn, each with the same initial constitution. Under this model of dispersion association between like alleles, the expectation E ( N z ) is unchanged, but from equations (10) and (11) we have var(mZ) = n q ~( 1 - q ~ )[1

+ (n-1) PI

+E[var(N~jm~)l + (n-1) p ) + v n q z cov(Nz,Nl,) = - e2 n g l q z , [l + (n-1) p ] . var(NL) = var[E(Nl/mz)l =e2nqt (l-qt)(l

and hence and

(12) (13)

(14)

These three formulae will be required in the following section. CLUSTER DISPERSION A N D AGE ESTIMATION

Our aim is to estimate parameters of American Indian demographic history from current genetic data. Observable variability between subpopulations, with regard to frequencies of rare variant alleles, conveys information about the parameters of the models of fission and variant replication. Suppose now we have the single fission model of the previous section, and numbers { N z ;1 < I < T - } are observed in current subpopulations. Since alleles disperse initially to subpopulations with varying probabilities {qZ;1 < 1 < r } , it is Nz/qz whose expectation is constant over subpopulations [equation (791.Thus the appropriate measure of variability is

i Nz,/qz*)I2/(T--l) (NZ/qz)2- &?, ( ~ z ~ z ~ / q z ~ z ~ > / [ T - (,r - ~ . l

D = 1=1 i [Nz/qz - ( l / r )( = (I/?+)z=1

Z'=l

(15)

and conditional on the number of variants in each subpopulation immediately subsequent to fission, we have

x

E(D I { m ~1 ;G I < T - } ) = ( u / r ) z=1 (mz/qz2)+e2D, ,

(16)

where e and U are the mean and variance of the number of current replicates of an original gene, as before, and Doin the initial dispersion. Also, using equation (12) we readily obtain

E(D,)

=n

Q1

11 + (n-1) PI, where

Q1

= ( l / r )Z ( l / q z )

(17)

486

E. A. THOMPSON

and hence

E(D)= E [ E ( D 1

{ml;

1 < 1 < r})] =nQ, {v

+ ez [I + (n-I)

p]}

.

(18)

The parameter p is similar to the p of ROTHMAN, SING and TEMPLETON (1974), in that it measures correlations in allelic types within a deme. However, it is obtained via a different model, and whereas their p encompasses all observed allelic association, ours is a parameter of initial association due to the fission mechanism only, this being separated from the contribution due to the subsequent replication process. I n addition to observing variation between subpopulations, we may observe also the total number of replicates of the variant allele. I n addition to total numbers N = Z Z Nz, it is also convenient to consider the weighted sum N* = (l/r) Zz ( N d q z ) . These have expectation E ( N * ) = E ( N ) = ne while from (13) and (14) we obtain var(N*) = (l/r) Q1 nv

(from equation (7'),

+ e2n(l + (n-1)p)

{ Q J r - I}.

(19) (20)

In the case qz = l/r for each 1 (1 < 2 < r ) , this variance (20) reduces to nu, the variance of N . We shall also require the variance of the dispersion D.This involves the fourth moments of { N i ; 1 < i < r } , but can be evaluated for the case p=O, when {mi;1 < i < r } have a multinomial distribution. Lengthy algebra provides

+

CZ2n [2(n-l)Qz { I - I/(r-l),} var(Dlp=O) = nC,Q,/r - Q1' {I - 2(~~-1)/(r--l)'}]

(21 1

where Qs = ( l / r ) Zt (l/qzs) and Cj is the jth (absolute) moment of the expected number of current replicates of a single original gene. It is often convenient to assume a two-parameter geometric distribution for the number of offspring replicates (KEIDINGand NIELSEN1975). In that case, we have C, =U

+ e2= e(2k-1)

(22)

where k is the expected number of replicates conditional upon nonextinction (THOMPSON 1976) and

C , = e(24k3- 36k2

+ 14k - 1).

The variables e and k are of course functions only of the parameters of the replication process; i.e., of t and the parameters of the offspring distribution. If we assume a geometric offspring distribution with parameters constant over time, the number of replicates provides an estimate of age if the mean rate of increase (A) is known, or of X if the age is known. [On the basic of total numbers alone, the two parameters are confounded.] The dispersion of the variant also provides information, however, and the statistics D and N* provide an estimate of any two combinations of the four basic parameters, h,n,t and p. Note again that since we now refer only to the history since fission, and not to the total history of a private variant allele, the statistics D and N* may be considered for

48 7

POPULATION VARIABILITY

any allele of sufficiently small frequency, but only for a private variant arising from a single mutation in the not tm distant past is it then legitimate to consider projecting the same further back in time to obtain, from the number of replicates ( n )at time of fission ( t ) ,an order of magnitude for the total age. If variant frequencies differ widely between subpopulations, this is an indication of rapid growth from small initial numbers. WARDand NEEL(1976) note the existence of a strong cline in the frequency of the Yanomama albumen variant Yan2, due to differentiation between tribal subgroups. Yan2 may thus be a younger variant than is indicated by total numbers alone. I n addition to the process parameters, the statistics D and N* involve also the variables Nz and ql. The { q z ;1 < Z < r } are historical parameters, and the {Nz; 1 < Z < r } , although data observations, may not be accurately known. Provided that the number of variant replicates is small compared to total cluster size, however, we may view the expansion of total clusters as relatively deterministic, and D and N* may be computed from allele frequencies. Suppose that there are a total of M z genes in subpopulation I (1 < Z < r ) and M = ZZM I , then f z = N t / M t is the variant allele frequency in this subpopulation, and substituting qz = M , / M we have

D=M2Ez

(fz-f)2

and N * = M f where f = (l/r)Ezfz

.

Thus, to estimate D and N * , estimates of only M and { f l ; 1 < 2 < r } are required. T o estimate the total number of replicates, N , the relative total cluster sizes {ME; 1 < I < r } must also be known. Where feasible, this is preferable, since A’ has a smaller variance than N * . A P P L I C A T I O N T O HIERARCHIC CLUSTERS O F Y A N O M A M A I N D I A N S

The American Indian tribe whose genetic and demographic structure has been most extensively studied is the Yanomama (NEELand WEISS1975; SPIELMAN, et al. 1974). Of particular interest is the polyMIGLIAZZA and NEEL1974; TANIS morphic albumen variant Yan2 (NEEL1978). The number in which this variant is replicated suggest it may predate tribal differentiation (THOMPSON 1976), yet the fact that it is restricted to this tribe implies either a younger variant with a rapid rate of increase or extreme genetic isolation of the Yanomama over a very long period. It is thus of interest to consider what further information with regard to these alternatives is provided by the dispersion of the variant within the tribe. In addition, there are several other red cell polymorphisms whose tribal distributions have been studied, and in which one allele has a sufficiently low frequency for analysis via a branching process model (SMOUSE and NEEL1977). These provide additional information for estimates of parameters of the tribal fission process, although we shall not, of course, wish to estimate their ages. Although for a tribal population we can never have accurate demographic histories, for the Yanomama some historical information is available, and the current demographic picture has been extensively studied. Three tribal clusters (Ocamo, Wanaboweitari and Nomoweitari) are believed to have dispersed from

488

E. A. THOMPSON

a single origin some four or five generations ago. Two other clusters (Padamo and Shamatari) fall within the same linguistic group, and probably have a similar time-depth of separation, although they are currently more isolated from the former three clusters than are the three among themselves (SPIELMAN, MIGLIAZZA and NEEL 1974; MIGLIAZZA, personal communication). For the former three clusters, most villages have been studied, and since villages within a cluster are the units from which, on tribal expansion and fission, new clusters will form, dispersion of variant replicates between villages provides an estimate of the lineal associations in the tribal fission process (i.e.,of p ) . The above five clusters make up the Yanomami linguistic group. I n addition, there are three other groups; Ninam (or Yanam), Yanomam and Sanema (MIGLIAZZA, personal communication). Although these three clusters have been less extensively studied, estimates of allele frequencies and total population sizes are available. The time depth of the dispersion between the four linguistic groups ("superclusters") is unknown, and they may result from a tree structure rather than from a single fission. Nevertheless, the allelic dispersion between them provides a third level to the analysis. The allele frequencies in the following analysis are taken from the tabulation of SMOUSEand NEEL (1977), and population sizes have been provided by MIGLJAZZA (personal communication). Diferentiation between villages From equation ( 1 7 ) , the dispersion between villages within a cluster has the expectation E ( D , ) =nQl ( 1 (n-1) p ) . From equation ( 2 1 ) with C j = 1 for each i and with p = 0, we have var(D,) = n Q , / r + n [ 2 ( n - l ) Q z ( 1 - l / ( r - l ) z } - Q l 2 ( 1 - 2 ( n - l ) / ( r - l ) ' } ] . Thus, if relative village sizes, and hence Q1,are known, and if n,the total number of variant replicates in the cluster is also given, we may estimate p by p = { (Do/nQ1)- l } / ( n - 1 ) with a variance under the null hypothesis of no association of var(p) = v a r ( D o > / { n ( n - l ) Q 1 } '= { Q 3 / Q 1 2 - l } / n ( n - 1 ) 2 2 {Qzr(r-2)/QI2 l}/n(n-1) ( r - l ) z .

+

+

+

More generally, at some point in time subsequent to fission, D/Q1 has the expectation n{u e 2 ( 1 ( n - l ) p ) } , [see equation ( 1 8 ) ] or ne{ (2k-1) e(n-l)p}, [see equation ( 2 2 ) ] , while N and N * , the unweighted and weighted estimated numbers of current replicates, both have expectation ne [see equation ( 1 9 ) ] .Thus D/NQ1 provides an estimate of

+

+ +

((2k-1) +e(n-l)fJI

Y

(23)

which is an increasing function of time since fission. We shall refer to D/NQ1 as the "normalized dispersion". Note that its expectation ( 2 3 ) is locus dependent, since it is a function of n, the number of variant copies. The dominant element, however, is (2k-1) , which depends only on the demographic history. Table 1 gives the values of D,/nQl, obtained for six different loci with a lowfrequency allele, in each of the three village clusters. Corresponding estimates

489

POPULATION VARIABILITY

TABLE 1

Dispersion between villages YanZ

Rh-c

S

Hp2

G c ~

PGM?

Ocamo Total population size* = 451 Number of villages ( r ) t = 6 Number of adult genes2 = 433 Q, = zr,(l/q) = 7.103 Normalized dispersion (D,/nQ,) Estimates of p Standard deviations of p-estimates

1.745 1.097 3.214 2.524 1.992 2.269 0.0473 0.0010 0.1079 0.0246 0.0256 0.0878 0.0437 0.0068 0.0334 0.3110 0.0175 0.0478

Wanaboweitari Total population size = 421 Number of villages =7 Number of adult genes = 404 Q, = 7.279 Normalized dispersion Estimates of p Standard deviations of p-estimates

2.073 0.600 2.784 2.400 1.14.F 2.711 0.0365-0.0076 0.06M 0.0258 0.0089 0.10% 0.0198 0.0111 0.0215 0.0107 0.0357 0.0352

Namoweitari Total population size = 712 Number of villages =6 Number of adult genes = 684 Q, = 8.268 Normalized dispersion Estimates of p Standard deviations of p-estimates

1.741 2.653 1.860 2.625 2.563 2.367 0.0148 0.0269 0.0361 0.0161 0.0287 0.0398 0.0121 0.0097 0.0161 0.0082 0.0274 0.0168 ~

~

~~~~~

* Total number of individuals i n the villages used in this analysis, as given by MIGLIAZZA (personal communication). tNumber of villages used in this analysis. t 48% of total (see NEELand THOMPSON,1978). D Based on the allele frequencies of SMOUSEand NEEL(1977).

of p and the estimated standard deviations under the hypothesis p = 0 are also given. Although, individually, many of the estimates are not significantly nonzero, overall we have a clear picture of small but positive associations. The S locus exhibits low associations; this may be due to the fact that the S allele has a mean frequency of 0.15 in these groups, and cannot truly be treated as a rare allele since one individual in 50 will be homozygous SS. The other allele with a mean frequency of over 0.1 ( H p - 2 ) also shows lower estimates. However, the differences between estimates obtained from the different loci and different clusters are not significant. The linear combination of estimates with smallest variance is p = 0.0187 with standard deviation 0.0031. Excluding the anomalous S allele, we have p = 0.0282 with a standard deviation of 0.0041. We may thus conclude that there are significant associationsin the dispersion of alleles between villages, due to genealogical relationships between the individuals within them. These initial associations will increase disperson between tribal clusters, resulting in inflated estimates of times of divergence if this factor is not taken into

490

E. A . THOMPSON

account. Although the association is small, p measures a per-variant-replicate association, and the overall effect may be large. I n the remainder of this section we shall assume a p value of 0.02, this being the most plausible estimate available at this time.

Diflerentiation between clusters Table 2 gives the results of the analysis of dispersion between the three localized clusters known to have fissioned four or five generations ago, as well as the results for all five clusters of the Yanomami “supercluster”. In spite of considerable variability between loci, the data provide overall evidence of a recent split, but with some differentiation between groups. Between villages within a cluster, the normalized dispersions range from 0.6 to 3.2 (Table 1). Between clusters, they range from 2.82 to 20.99, with the exception of one value for each of the two alleles Rh-c and H p s . Both of these show extraordinarily similar frequencies over the subset of three clusters, although not over all five. There is no significant difference in dispersion between the three clusters and the five, although the former are smaller in the majority of cases. For the Yun2 serum albumen variant, which is of particular interest here, the two values (5.46 and 4.18) are very similar. We may analyze the dispersion values further on the basis of a geometric offspring distribution, assumed constant over generations, which is parametrized by the mean A and the parameter h=c/ (1-c) ,where c is the geometric parameter. This has been found to be the most convenient parametrization for these analyses, and several approaches have given a consistent value of h = 1.5 for the 1976; THOMPSON and NEEL1978). To avoid introYanomama (THOMPSON ducing a further parameter, we shall accept this value for h. We again note that, in accepting this branching process model for replication subsequent to fission, no assumption about the total age or origin of the allele is made. For this offspring distribution, we have e = At and (2k-1) = 1

+ 2 ( h t - l)/(A-l)h

,

where t is the number of generations since fission. Thus, from equation

(B),

H = 2(Xt-l)/(X-I)h - pXt which may be estimated by No= D/NQ1 - N - 1,

(24)

we have, say,

where D/NQ, is the normalized dispersion discussed above, and N is the current number of replicates. Since estimates of relative cluster sizes are available, we use the estimate N of total numbers, which has smaller variance than the weighted estimate N I . Accepting the fixed values t=4, h=1.5, obtained from other studies, we consider first the case where p=O. It is known that the population of these clusters has increased over the last 100 years, possibly as fast as 15% per generation (NEELand WEISS1975) and we therefore consider A values in the range of 1 < X Q 1.2 (THOMPSON 1976). For all such A, we can reject the hypothesis p=O, while estimating p for each locus for a variety of X values, we obtain values

POPULATION VARIABILITY

mood-om

mmuJco1

99999 0 0 0 0 0

(o4rOOd-

moow"(o

99999 0 0 0 0 0

0 0 0 0 0

Zg88 P m

49 1

492

E. A. THOMPSON

consistent with the previous estimate of ~ ~ 0 . 0Conversely, 2. accepting this value for p, we find values of h in the range of 1 to 1.2. However, there is insufficient time-depth to distinguish between hypothesized values in this range. Dispersion between superclusters

Estimates of the normalized dispersion, (D/NQ,) , between the four linguistic groups (“superclusters”) are given in Table 3, where it can be seen that the values range from 34.3 to 289.7. It is of interest that these values show no overlap with the dispersion values between clusters within a “supercluster” (Table 2) and also that they fall into two distinct groups. The S, Rh-c and Yan2 alleles give values in the range 200 to 300, and the other three in the range from 34 to 42. Although normalized dispersion has an expected value that is a function of the number of replicates at fission, and hence varies between loci, this dispersion dichotomy does not reflect a current dichotomy in number of replicates. It could reflect some selective force, since the three smaller values reflect a n extreme similarity between linguistic groups. It could equally reflect a chance dissimilarity in growth, with initial number of Yan2, S and Rh-c alleles being smaller, but predominating in the groups with a larger growth rate. Over the (unknown but probably substantial) time period involved, no single value in this range can be rejected as the result of such chance effects, but the way in which the values fall into two such distinct classes is suggestive of process heterogeneity. Estimates of the time of fission similarly vary between loci, and for the estimation of this time, t, values of the dispersion should be pooled. Over long periods of time, the mean rate of increase cannot greatly exceed 1; the value h=l gives a corresponding estimate of the time since supercluster divergence of 79.4 generations (2000 years). Considering only the five low-frequency polymorphisms, and not the private variant Yan2, the estimate becomes 68.5 generations. These estimates are higher than those obtained from linguistic studies, but are not outside the range of possibility for superclusters that may be considered almost distinct tribes (MIGLIAZZA, personal communication). They are also upper bounds; higher values of X give smaller estimates of the time of divergence. For example X 4 . 0 2 gives estimates of 48.3 and 43.6 generations for the two cases (including and excluding Yan2, respectively). Estimates for other values of h are given in Table 4. We may compare the estimates here with those of SPIELMAN, MIGLIAZZA and NEEL(1974). On the basis of linguistic divergence, these authors estimated the time-depth of the separation of our four super-clusters at between 600 and 1200 TABLE 3 Dispersion between superclusiers Ym2

Estimated number of replicates ( N ) Weighted estimate ( N * ) Normalized dispersion ( D / N Q , )

S

Rh-c

1062 2453 1056 1573 2590 1653 200.50 289.67 227.74

Hp2

2454 2557 42.02

Gca

PGM,=

1783 632 2349 814 35.33 34.34

493

POPULATION VARIABILITY

TABLE 4 Estimates of times of dispersion of superclusters, and corresponding estimates of the parameters of the history of the Yan2 variant

Tribal growth rate (A)

1.000 1.005 1.010 1.015 1 .e20 1.025

1.030

Dispersion time of superclusters (generations)

Including YanZ

Excluding YanZ

79.4 67.6 59.1 53.0 48.3 44.5 41.4

68.5 59.1 52.4 47.5 43.6 40.4 37.8

Corresponding estimate: for YariZ Number of replicates at Total age time of fission (generation)

Growth rate

1.018 1.025 1.032 1.039 1.046 1.053 1.060

313 247 204 173 149 132 117

190 148 125 108 96 86 78

years (or between 24 and 60 generations). Our estimate is larger, but of the same order. Since the retention rate of cognates is unknown for small tribal populations, we may say that the genetic and linguistic evidence are in substantial agreement. It is also impossible to know to what extent contact subsequent to fission has influenced the estimates. Although migration will decrease estimates based on either genetic or linguistic data, the effect may be greater in the linguistic case, since there contact even without migration will influence divergence. We therefore suggest that their estimates are lower bounds, whereas our largest estimate (that with A=l) may not be an overestimate.

The age of Yan2 The higher-than-average dispersion between clusters for Yan2 indicates that this allele may indeed have, by chance, undergone higher-than-average rates of increase. The value of H for Yan2 is 178.06, and, accepting now the estimates of t obtained from the other loci, we may reverse our previous computations and estimate h-values for Yan2, and hence also the number of Yan2 replicates at the time of fission. These also are given in Table 4.Assuming that the same A-values for Yan2 obtained before tribal fission (an assumption for which there can never be evidence either for or against), we may estimate the total age of the Yan2 variant, as in THOMPSON (1976). These estimates provide the final column of Table 4. The bounds of error on the estimates are wide, but the following points are significant: (1) Yan2 does have a higher-than-average dispersion, indicating a higher-than-average growth. (2) There is not necessarily any implication of selection in this observation, the values being well within the bounds of chance demographic variation. (3) The dispersions of a set of alleles do provide an estimate of an event in demographic history, which can in turn be used to estimate growth rate. We note that the final estimates of age are in agreement with a value hypothesized by THOMPSON (1976). There it was commented that a value h=l.02 would give an estimate of 168 generations for Yan2. (The slight discrepancy between this value and the 190 generations of Table 4 is accounted for by the smaller

494

E . A. THOMPSON

estimate of 875 total adult-variant-replicates used at that time). However, no justification for such a A-value was given, beyond the fact that it could be maintained over long periods of time without demographic absurdity. The confirmation from this analysis that a A-value around 1.02 may be appropriate for Yan2, even making the conservative assumpticn of a total tribal A of 1.0, is thus of considerable interest. More important, we see that the present approach may provide information on the different effects of demographic variability on different alleles. Previously, values of greater than 1.02 were rejected as implausible. We see now, however, that a tribal A of 1.02 allows a A-value for Yan2 of nearly 1.05, with a corresponding estimate of age of less than 100 generations. This variation between alleles may be due to chance demographic factors, without implication of selection, and the current approach makes possible the detection of such factors. Even I O 0 generations is a very substantial period of time for a variant to have been limited to a single tribe. These results emphasize again the extreme genetic isolation of the Yanomama discussed by WARD et al. (1975). This research was done while visiting the Department of Human Genetics, University of Michigan, supported by National Science Foundation grant BMS-74-11823, and completed while visiting the Department of Medical Biophysics and Computing, University of Utah. I am grateful to E. MIGLIAZZA, J. V. NEEL,P. E. SMOUSEand R. SPIELMANfor many helpful discussions. LITERATURE CITED

FELLER, W., 1968 A n Introduction to Probability and its Applications. Vol. 1 (2nd ed.). Wiley, New York, FIX,A., 1975 Fission-fusion and lineal effect; aspects of the population structure of the Semai Senoi of Malaysia. Am. J. Phys. Anthropol. 43 :295-302.

JOHNSON, N. L. and KOTZ,1977 Urn Models and Their Applications. Wiley, New York. NEEL, J. V., 1967 The genetic structure of primitive human populations. Japan. Human Genet. 12: 1-16. -, 1978 Rare variants, private polymorphisms, and locus heterozygosity in Amerindian populations. Am. J. Human Genet. 30: 465-490. NEEL,J. V. and E. A. THOMPSON, 1978 The number of private polymorphisms in a tribal population. Proc. Natl. Acad. Sci. U.S. 75: 1904-1908. NEEL, J. V. and K. WEISS,1975 The genetic structure of a tribal population, the Yanomama Indians. XII. Biodemographic studies. Am. J. Phys. Anthropl. 42: 25-52. ROTHMAN, E. D., C. F. SINGand A. R. TEMPLETON, 1974 A model for analysis of population structure. Genetics 78: 943-960. SMOUSE,P. E. and J. V. NEEL, 1977 Multivariate analysis of gametic disequilibrium in the Yanomama. Genetics 85: 733-752. and J. V. NEEL,1974 Regional linguistic and genetic differSPIELMAN,R. S . , E. C. MIGLIAZZA ences among Yanomama Indians. Science 184: 637-644.

TANIS, R., R. E. FERRELL, J. V. NEELand A. MORROW, 1974 Albumin Yanomama-2, a “private” polymorphism of serum albumin. Ann. Human Genet. 37: 327-332. THOMPSON, E. A., 1976 Estimation of age and rate of increase of rare variants. Am. J. Human Genet. 28: 442-452. -, 1977 Estimation of the characteristics of rare variants. Proceedings of the Oue Fryenberg Memorial Symposium, Aarhus, 1976. Springer-Verlag, New York.

POPULATION VARIABILITY

495

THOMPSON, E. A. and J. V. NEEL.,

1978 The probability of founder effect i n a tribal population. Proc. Natl. Acad. Sci. U.S. 75: 144514.F5.

WARD, R. H., H. GERSCHOWITZ, M. LAYRISSE and J. V. NEEL,1975 The genetic structure of a tribal population, the Yanomama Indians. XI. Gene frequencies for 10 blood groups, and the ABH-Le Secretor traits in the Yanomama and their neighbors; the uniqueness of the tribe. Am. J. Human Genet. 27: 1-30.

WARD, R. H. and J. V. NE=, 1976 The genetic structure of a tribal population, the Yanomama Indians. XIV. Clines and their interpretation. Genetics 82: 103-121. Corresponding editor: W. J. EWENS

Fission models of population variability.

FISSION MODELS OF POPULATION VARIABILITY E. A. THOMPSON King’s College, Cambridge, CB2 I S T , England Manuscript received March 26, 1979 ABSTRACT Mo...
966KB Sizes 0 Downloads 0 Views