J Mol Evol (1992) 35:196-204

Journal of Molecular Evolution (~ Springer-VerlagNewYorkInc. 1992

Patterns of Nucleotide Substitutions Inferred from the Phylogenies of the Class I Major Histocompatibility Complex Genes Tadashi Imanishi* and Takashi Gojobori DNA Research Center, National Institute of Genetics, Mishima, Shizuoka 411, Japan

Patterns of nucleotide substitutions in human major histocompatibility complex (MHC) class I genes were estimated by using phylogenetic trees of DNA sequences. The pattern is defined as a set of 12 parameters, each of which represents the relative frequency of substitutions from a particular nucleotide to another. The pattern at the antigen recognition sites (ARS) in functional MHC genes was remarkably different from that at the remaining coding region (non-ARS). In particular, the proportion of transitions among all the nucleotide substitutions (Ps) was extremely low at the third codon positions of ARS. In the HLA-A genes, Ps at the third codon positions was only 6% in ARS, whereas it was 69% in non-ARS. In HLA-B, the corresponding values were 30% in ARS and 80% in non-ARS, respectively. On the other hand, Ps in a class I pseudogene (HLA-H) was 57%, which was in good agreement with P, in other pseudogenes. Because pseudogenes are selectively neutral, the pattern in pseudogenes is regarded as the pattern of spontaneous substitution mutations. In general, the pattern in functional genes that are subject to selective forces deviates from the pattern in pseudogenes. At the third codon positions in coding regions, transitions scarcely cause amino acid replacements, whereas about half of transversions do cause replacements. Accordingly, Ps at the third codon positions decreases if amino acid replacements are accelerated by natural selection but increases if amino acids are conserved by functional constraint. Our observations imply that the ARS region is subject to natural Summary.

* Present address: Department of Anthropology, Faculty of Sci-

ence, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 1 13, Japan Offprint requests to: T. Gojobori

selection favoring amino acid replacements, whereas the non-ARS region is subject to functional constraint. Key words: Major histocompatibility complex (MHC) -- HLA -- Antigen recognition site (ARS) - - Patterns of nucleotide substitutions -- Natural selection -- Functional constraint -- Pseudogene -Neutrality

Introduction

Because DNA is composed of four kinds of nucleotides, there are 12 different types of substitutions in each of which one nucleotide is replaced with another. These substitutions do not occur with an equal frequency in the evolutionary process of DNA. One way to evaluate the inequality is to calculate relative substitution frequencies (Gojobori et al. 1982; Li et al. 1984). The pattern ofnucleotide substitutions is defined as the set of relative substitution frequencies. The knowledge of substitution patterns is important to understand the mode of molecular evolution. The pattern in pseudogenes is regarded as the pattern of spontaneous substitution mutations, because pseudogenes are free from any selective forces (Gojobori et al. 1982; Kimura 1983). On the other hand, the pattern in functional genes is generally different from that in pseudogenes. Furthermore, the patterns in functional genes vary with kinds of genes and with codon positions due to the difference in types of natural selection or functional constraint. Therefore, the type of selective forces on genes can be specified by examining the substitution patterns.

197 T h e class I m a j o r h i s t o c o m p a t i b i l i t y c o m p l e x (MHC) molecules are glycoproteins that exist on the m e m b r a n e o f m o s t n u c l e a t e d cells. T h e s e m o l e c u l e s bind with processed foreign peptides and present t h e m t o t h e c y t o t o x i c T - l y m p h o c y t e s , w h i c h triggers the immune response against foreign peptides (Klein 1986). H u m a n M H C m o l e c u l e s a r e c a l l e d h u m a n leukocyte antigens (HLA). Class I HLA genes comp r i s e a m u l t i g e n e f a m i l y in w h i c h 17 d i s t i n c t l o c i e x i s t ( K o l l e r et al. 1987). A m o n g t h e m , H L A - A a n d HLA-B are highly polymorphic and apparently f u n c t i o n a l genes. R e c e n t l y , a n e w class I gene H L A - H (previously called HLA-AR) was discovered (Zemm o u r et al. 1990). H L A - H is a s s u m e d t o b e a p s e u d o g e n e b e c a u s e it l a c k s a c y s t e i n e r e s i d u e t h a t is n e c e s s a r y to f o r m a d i s u l f i d e b o n d a n d t o s t a b i l i z e the protein structure. Furthermore, most HLA-H alleles have a frameshift mutation that raises a term i n a t i o n c o d o n in t h e a 3 d o m a i n . I t is k n o w n t h a t H L A - H is t i g h t l y l i n k e d t o H L A - A w i t h p h y s i c a l d i s t a n c e b e i n g w i t h i n 2 0 0 k b ( P o n t a r o t t i et al. 1988). B j o r k m a n et al. ( 1 9 8 7 a , b ) d e t e r m i n e d t h e t h r e e d i m e n s i o n a l s t r u c t u r e o f a h u m a n class I M H C m o l ecule, H L A - A 2 . T h e y a l s o d e t e r m i n e d t h e r e s i d u e s o f a n t i g e n r e c o g n i t i o n sites ( A R S ) . A R S is c o m p o s e d o f 57 a m i n o a c i d r e s i d u e s t h a t a r e r e s p o n s i b l e f o r b i n d i n g to p e p t i d e s a n d r e c o g n i t i o n b y T - c e l l rec e p t o r s . H u g h e s a n d N e i (1988) o b s e r v e d t h a t t h e numbers ofnonsynonymous substitutions are larger than those of synonymous substitutions in ARS, but n o t in n o n - A R S o f class I M H C genes. F r o m t h i s observation, they suggested that amino acid replacem e n t s i n A R S a r e a c c e l e r a t e d b y o v e r d o m i n a n t selection (heterozygote advantage). However, the o v e r d o m i n a n c e h y p o t h e s i s is still c o n t r o v e r s i a l , a n d t h e r e a r e a l t e r n a t i v e h y p o t h e s e s (for e x a m p l e , B o d m e t 1972; H i l l et al. 1991). S t r i c t l y s p e a k i n g , t h e c o m p a r i s o n b e t w e e n A R S a n d n o n - A R S is n o t e n o u g h t o j u d g e w h e t h e r n a t u r a l s e l e c t i o n is o p e r a t i n g o r not. I t is n e c e s s a r y t o m a k e a c o m p a r i s o n w i t h t h e p a t t e r n in p s e u d o g e n e s t h a t a r e g e n u i n e l y neutral. Thus, we examined the patterns of nucleo t i d e s u b s t i t u t i o n s i n M H C p s e u d o g e n e s as w e l l as A R S a n d n o n - A R S o f f u n c t i o n a l genes. B e c a u s e many MHC alleles have already been sequenced, we c o l l e c t e d DNA s e q u e n c e d a t a a n d e s t i m a t e d t h e s u b s t i t u t i o n p a t t e r n s u s i n g p h y l o g e n e t i c trees. O u r results support Hughes and Nei's assertion that posi t i v e s e l e c t i o n is o p e r a t i n g o n A R S , a l t h o u g h w e c o u l d n o t i d e n t i f y it as o v e r d o m i n a n t s e l e c t i o n .

relatively small. We used all coding regions of functional genes and the corresponding regions of pseudogenes. The total lengths of sequences were 1098 bp in HLA-A, 1095 bp in HLA-B, and 1097 bp in HLA-H. Sequences shorter than 1000 bases long were discarded. In the end, 19 alleles of HLA-A, 26 alleles of HLAB, and 5 alleles of HLA-H were suitable for the present analysis. Most alleles were named according to the standard nomenclature method (The WHO Nomenclature Committee for factors of the HLA system 1990), whereas others were given the same names as those used in the original papers. These sequences could be aligned easily because they needed only a few gaps to be inserted. To construct the phylogenetic trees, the number of nucleotide substitutions per site was calculated for each pair of alleles assuming the two-parameter model (Kimura 1980). Then, phylogenetic trees were constructed by the neighbor-joining (NJ) method (Saitou and Nei 1987). We constructed two phylogenetic trees: one contains all alleles of HLA-A and HLA-B (Fig. 1) and the other contains all HLA-H alleles (Fig. 2). We inferred the nucleotide sequences of ancestral MHC alleles from the phylogenetic trees by using the maximum parsimony method. By using Fitch's algorithm (Fitch 1971), nucleotides at every node (the branching point) were determined in order to minimize the number of substitutions in the whole tree. There were cases where different combinations of nucleotides at some nodes turned out to be equally parsimonious. In such cases, all kinds of nucleotides that realize the least number of substitutions were assigned to the nodes. Then, we counted the number of each kind of nucleotide substitution for each gene region. We considered exclusively the substitutions that occurred after the most recent common ancestors (represented by solid circles in Figs. 1 and 2). If the nucleotide at a node had not been determined uniquely, we ignored the substitutions that must have occurred on the branches connected with the node. Let us designate the number of substitutions from nucleotide i to nucleotide j by n~j and the average number of nucleotides i among all sequences by N(i). The proportion of substitutions from nucleotide i to nucleotide j is expressed by

Methods

Results

DNA sequences of HLA-A, -B, and -H genes were collected from the literature and DNA data bases (DDBJ, EMBL, and GenBank). HLA-C and nonclassical HLA genes were not used in the present study because the numbers of available allelic sequences were

Phylogenetic Trees of MHC Genes

Pjj - ~-~

(1)

Then, the relative frequencies (%) of 12 kinds of nucleotide substitutions are calculated by Pij i

x 100 (i,j = A , T , C , o r G ; i # j )

(2)

j~l

f~j is called the relative substitution frequency (Gojobori et al. 1982). It is regarded as the expected number of substitutions from nucleotide i to nucleotide j among I00 substitutions in a hypothetical random sequence that is composed of equal numbers of A, T, C, and G. The proportion of transitional substitutions was calculated by Ps = fAa + fTC + fCT + fCA

(3)

Furthermore, the expected base contents at the equilibrium state, Or(i),were calculated using the values off~j (Wright 1969; Tajima and Nei 1982).

The phylogenetic tree of functional MHC genes has t w o m a j o r c l u s t e r s c o r r e s p o n d i n g to H L A - A a n d

198

~

IA*0101 t

~

d_~*0201 A*0203 d t A*0206 r-----~ t--- A*0205 I t.. A*0210 I ~ A'6801 I_[-I-- A'6802 t.._ A'6901 A'2501 I__ A'2601 A'2901 A*3101 I A;2 A'3301 A'2401 B'5701 I B'5801 B'3501 r--- B'5101 L B'5201 r--- B'1301 ~_ --B*1302 r- B'2702 I__r- B'2703 ~B'2705 B'4701 r--- B'4401 t. B'4402 B'4101 [ B'4901 [- B*1501 B'4601 [ - B'1801 B'3701

HLA-A

HLA-B

1~.

|

l

0.07

0.05

AllE A*0301

i

l

B*0702 B'4201 "0801 1401

u

u

0.01

Number of nucleotide substitutions per site

AR1

£ l

Fig. 1. A phylogenetic tree of functional class I MHC genes. The tree was constructed by the NJ method based on the numbers of nucleotide substitutions. The root of the tree was arbitrarily determined at the midpoint of the branch between the dusters of HLA-A and HLA-B. The most recent common ancestors for HLA-A and HLA-B are represented by solid circles. The following is the list of alleles analyzed. The accession numbers of GenBank were also shown in brackets if they are available. HLA-A: A*0101 [M24043 (Parham et al. 1989)]; A*0201 [M32322 (Ennis et al. 1990)]; A*0203 (Holmes et al. 1987); A*0205 (Holmes et al. 1987); A*0206 [M24042 (Parham et al. 1989)]; A*0210 (Epstein et al. 1989); A*0301 (Strachan et al. 1984); A*1101 [M1600716010 (Cowan et al. 1987)]; A11E [X13111 (Mayer et al. 1988)]; A'2401 (N'Guyen et al. 1985); A'2501 [M32321 (Ennis et al. 1990)]; A'2601 [M24095 (Cianetti et al. 1989)]; A'2901 (Trapani et al. 1989); A*3101 (Kato et al. 1989b); A32 (Parham et al. 1989); A'3301 (Kato et al. 1989b); A'6801 (Holmes and Parham 1985); A'6802 (Holmes et al. 1987); A'6901 (Holmes and Parham 1985). HLA-B: B*0702 [M32317 (Ennis et al. 1990)]; B*0801 [M24036 (Parham et al. 1989)]; B*1301 (Kato et al. 1989a); B*1302 [M19757 (Zemmour et al. 1988)]; B*1401 [M24040 (Parham et al. 1989)]; B*1402 [M24032 (Parham et al. 1989)]; B*1501 (Pohla et al. 1989); B*1801 [M24039 (Parham et al. 1989)]; B'2702 [X03664 (Seemann et al. 1986)]; B'2703 (Choo et al. 1988); B'2705 [X03665 (Seemann et al. 1986)]; B'3501 (Ooba et al. 1989); B'3701 [M32320 (Ennis et al. 1990)]; B'3801 (Mfiller et al. 1989); B'3901 (Mfiller et al. 1989); B*4101 [M24035 (Parham et al. 1989)]; B'4201 [M24034 (Parham et al. 1989)]; B'4401 (Kottmann et al. 1986); B'4402 [M24038 (Parham et al. 1989)]; B'4601 [M24033 (Parbam et al. 1989)]; B'4701 [M19756 (Zemmour et al. 1988)]; B'4901 [M24037 (Parham et al. 1989)]; B*5101 [M21035 (Hayasbi et al. 1989)]; B'5201 [M21036 (Hayashi et al. 1989)]; B'5701 [M32318 (Ennis et al. 1990)]; B'5801 (Ways et al. 1985).

--~ AR3

AR4 HLA12.4 HLA-A*0101

0.05

0.01

0

Number of nucleotide substitutions per site Fig. 2. A phylogenetic tree of HLA-H alleles. The tree was constructed by the NJ method based on the numbers of nudeotide substitutions. An HLA-A allele (HLA-A*0101) was used as an outgroup. The root of the tree was arbitrarily determined at a point on the branch connected with the HLA-A allele. The most recent common ancestor of HLA-H is represented by a solid circle. The sources of HLA-H sequences are as follows: AR 1 [M32104 (Zemmour et al. 1990)]; AR2 [M32105 (Zemmour et al. 1990)]; AR3 [M32106 (Zemmour et al. 1990)]; AR4 [M32107 (Zemmour et al. 1990)]; HLA12.4 (Malissen et al. 1982).

HLA-B (Fig. 1). Nucleotide differences between HLA-A and HLA-B were much larger than those between different alleles on the same locus. Thus, it confirms that interlocus variations between HLA-A and HLA-B exceeded intralocus variations, as pointed out by Lawlor et al. (1988). It seems that the degree of sequence diversity among HLA-A alleles is not so different from that among HLA-B alleles, though the number of alleles is larger in HLA-B than in HLA-A. Among HLA-A alleles, A'2401 was the most diverse allele. Alleles A*0101, A*1101, A*I 1E, and A*0301 made a definite cluster. Alleles A*0201, A*0203, A*0205, A*0206, A*0210, A'6801, A'6802, and A'6901 made another large cluster. HLA-B alleles were largely divided into two groups. One of the two groups contained B*0702, B*0801, B*1401, B*1402, B'3801,

199 B ' 3 9 0 1 , and B ' 4 2 0 1 . This group roughly corresponds to Bw4 specificity. In constructing the phylogenetic tree o f H L A - H , an allele o f H L A - A ( H L A - A * 0 1 0 1 ) was used as an outgroup (Fig. 2). This is because H L A - H shares higher h o m o l o g y in D N A sequences with H L A - A than any other class I genes. The tree clearly indicates that H L A - H alleles have arisen from a cornm o n ancestral gene that m u s t have been a pseudogene. The degree o f diversity a m o n g H L A - H alleles was not so high as that a m o n g functional M H C alleles. The n u m b e r o f nucleotide substitutions that occurred between the c o m m o n ancestor and the extant alleles was, on the average, as m u c h as 0.01 per nucleotide site. The n u m b e r is a b o u t one-third o f that in functional M H C genes, which implies recent divergence o f H L A - H alleles.

Table 1. Relative substitution frequencies, f~j, in HLA-H and other genes; the numbers of substitutions, n0, in HLA-H are also presented

HLA-H A~ T A~ C A~ G T~ A T~ C T-~G c ~ A c T c ~ G G~A G~T G~ C ~

Ps

The Substitution Pattern in H L A - H The values o f nij and f~j in H L A - H were estimated based on the phylogenetic tree o f Fig. 2 (Table 1). The patterns in m a m m a l i a n nuclear pseudogenes (Gojobori et al. 1982), m a m m a l i a n nuclear functional genes [only the first and second c o d o n positions are included (Gojobori et al. 1982)], and o-globin spacer region [n-globin pseudogene and its adjacent n o n c o d i n g region ( G o o d m a n et al. 1989)] are also presented to make comparisons. These patterns were recalculated using the n u m b e r s o f substitutions (no) reported in original papers. In H L A H, we could use 35 substitutions to calculate the values o f f~j. In other words, the sum o f nij was 3 5. O n the other hand, at least 43 substitutions were required to account for the nucleotide differences by the m a x i m u m p a r s i m o n y method. This n u m b e r m a y be called the m i n i m u m n u m b e r o f required substitutions (MNRS). N o t e that the sum o f nij is always less than the value o f M N R S because not all nucleofides at the nodes were d e t e r m i n e d uniquely in our analysis. The p r o p o r t i o n o f the sum o f nij to M N R S was 81% in H L A - H . Thus, the majority o f nucleotide substitutions could be used to calculate the substitution patterns in H L A - H . The pattern in H L A - H was very close to those in other pseudogenes and n-globin spacer (Table 1). These patterns bore m a n y characteristics in c o m mon. For example, the transitions C -- T, G --, A, and A - G were very frequent in all o f these patterns. The p r o p o r t i o n o f transitions, P,, was 57.3% in H L A - H , which was in good agreement with Ps in other pseudogenes. However, there was a slight difference. F o r example, the transversion G -- C was frequent in H L A - H and functional genes but not in other pseudogenes. The pattern in 7/-globin spacer region was rich in the transition T - C but not in other genes. The equilibrium contents ofnucleotides

FuncPseudo- tional genesa genesa

n-globin spacerb

nij

fij

fij

fij

fij

0 2 3 1

0.0 7.9 11.9 4.9 4.9 0.0 5.3 23.7 7.9 16.8 4.8 12.0 57.3

4.9 5.3 11.0 4.3 6.5 4.7 8.2 21.9 4.6 16.0 6.9 5.5 55.5

4.3 6.4 11.3 5.3 4.2 2.1 8.1 8.9 13.0 20.6 4.8 11.0 45.1

3.1 3.5 14.7 2.3 15.2 2.8 4.9 18.9 4.9 21.0 4.7 4.4 69.6

1

0 2 9 3 7 2 5 (20/35)

a Data from Gojobori et al. (1982) bData from Goodman et al. (1989)

Table 2. Base contents in HLA-H and other genesa HLA-H

A T C G

ObExserved pected rr(i) Or(i)

Pseudogenes ¢r(i)

Functional genes ~'(i)

n-globin spacer ¢r(i)

20.1 16.4 30.2 33.2

28.1 38.0 14.5 19.4

31.4 32.6 17.9 18.0

28.1 29.6 22.2 20.1

26.3 44.9 15.8 13.0

a ~r(i) is the observed value; ~-(i) is the expected value at the equilibrium state calculated from the values of f~j in Table 1 All values are percentages

A and T, Or(A) and ~-(T), s u m m e d up to 71.2% in H L A - H (Table 2). In contrast, observed contents o f A and T in H L A - H were only 36.5%. Thus, the base contents in H L A - H are not at the equilibrium state, and the H L A - H gene is expected to increase its contents o f A and T in future.

First and Second Codon Positions of Functional M H C Genes Table 3 summarizes the n u m b e r s o f n u c l e o f i d e substitutions at the first and second c o d o n positions o f functional M H C genes. In Table 3, the values ofnlj, the n u m b e r o f transitional substitutions (transition), the sum o f nij (Enij), M N R S , the n u m b e r o f sites compared, and the average base contents [N(i)] are presented for each o f the A R S and n o n - A R S regions. A&B represents the sum o f H L A - A and HLA-B. The p r o p o r t i o n o f the sum o f n 0 to M N R S was, on the average, as high as 73%. The n u m b e r s o f sub-

200 Table 3. The number of nucleotide substitutions, no, and the observed base contents, N(i), at the first and second codon positions of class I MHC genes (see text for explanations) Non-ARS region HLA-A HLA-B A&B A~T A~C A ~ G T ~ A T ~ C T~G C~A C~ T C~ G G ~ A G~T G-~C

Non-ARS region

ARS region

1 4 10 0 1 3 1 9 6 5 2 9

2 4 15 1 4 4 3 15 13 12 4 18

4 2 10 2 5 5 1 5 9 5 5 3

6 7 19 13 12 11 3 5 9 8 7 5

10 9 29 15 17 16 4 10 18 13 12 8

Transition 21 Z ni~ 44 MNRS 55 Site 618

25 51 75 612

46 95 130 618

25 56 84 114

44 105 139 114

69 161 223 114

N(A) N(T) N(C) N(G)

145.9 110.1 167.2 188.8

150.6 113.5 165.4 188.5

36.5 21.0 25.4 31.1

39.3 21.0 24.9 28.7

38.1 21.0 25.1 29.7

Table 4. Relative substitution frequencies, fij, the proportion of transitions, Ps, and the equilibrium base contents, Or(i), at the first and second codon positions of class I MHC genes Non-ARS region

ARS region

HLA-A HLA-B A&B

HLA-A HLA-B A&B

2.4 0.0 12.0 3.2 9.6 3.2 4.5 13.6 15.8 13.9 4.0 17.9

2.1 8.6 21.5 0.0 2.8 8.5 1.9 16.8 11.2 8.3 3.3 14.9

2.3 4.5 17.0 1.5 6.0 6.0 3.1 15.5 13.4 10.9 3.6 16.3

5.4 2.7 13.6 4.7 11.8 11.8 2.0 9.8 17.5 8.0 8.0 4.8

3.9 4.6 12.4 15.8 14.6 13.4 3.1 5.1 9.3 7.1 6.2 4.5

4.4 4.0 12.8 12.0 13.7 12.8 2.7 6.7 12.1 7.4 6.8 4.5

Ps

49.0

49.4

49.3

43.1

39.3

40.6

Or(A) ~'(T) #(C) Or(G)

32.9 26.5 18.9 21.7

8.7 41.2 20.9 29.2

16.9 35.1 22.2 25.9

21.0 21.3 17.2 40.5

25.2 10.7 25.4 38.7

24.4 13.7 21.7 40.2

A A A T T T C C C G G G

~ T ~ C ~ G ~ A ~ C ~ G ~ A -~ T ~ G ~ A ~ T -* C

HLA-A HLA-B A&B

HLA-A HLA-B A&B

1 0 5 1 3 1 2 6 7 7 2 9

154.0 115.6 163.0 185.5

Table 5. The number of nucleotide substitutions, nij, and the observed base contents, N(i), at the third codon positions of class I MHC genes

s t i t u t i o n s p e r n u c l e o t i d e site i n t h e w h o l e t r e e w e r e 0.21 ( = 1 3 0 / 6 1 8 ) i n n o n - A R S a n d 1.96 (= 2 2 3 / 1 1 4 ) i n A R S , r e s p e c t i v e l y . T h u s , t h e s u b s t i t u t i o n r a t e in A R S is 9.3 t i m e s h i g h e r t h a n t h a t i n n o n - A R S at t h e first a n d s e c o n d c o d o n p o s i t i o n s . R e l a t i v e s u b s t i t u t i o n f r e q u e n c i e s , f~j, w e r e calc u l a t e d b a s e d o n t h e d a t a p r e s e n t e d in T a b l e 3 (see T a b l e 4). It is o b v i o u s t h a t t h e s u b s t i t u t i o n p a t t e r n

A~T A~C A-~G T~A T~C T~G C-~A C~ T C~G G ~ A G~T G -~ C

ARS region HLA-A HLA-B A&B

0 2 3 0 3 0 3 7 1 10 1 3

0 0 4 0 7 0 1 14 3 11 2 10

0 2 7 0 10 0 4 21 4 21 3 13

0 0 0 1 0 0 1 0 4 1 2 5

0 1 1 0 1 0 2 0 8 1 2 9

0 1 1 1 1 0 3 0 12 2 4 14

Transition 23 nij 33 MNRS 40 Site 309

36 52 64 306

59 85 104 309

1 14 19 57

3 25 30 57

4 39 49 57

N(A) N(T) N(C) N(G)

30.6 36.5 118.4 120.5

28.6 4.8 38.3 5.8 1 1 7 . 7 22.3 1 2 4 . 4 24.2

4.8 4.8 24.9 22.5

4.8 5.2 23.8 23.2

23.1 40.6 116.9 128.4

in A R S differs f r o m t h a t i n n o n - A R S . I n n o n - A R S , t h e t r a n s i t i o n s A -~ G , C -~ T, a n d G -~ A a n d t h e transversions between G and C were very frequent. I n A R S , o n t h e o t h e r h a n d , t h e t r a n s i t i o n C -~ T a n d t h e t r a n s v e r s i o n G -~ C w e r e less f r e q u e n t . I nstead, t h e s u b s t i t u t i o n s f r o m T to t h e o t h e r n u d e o t i d e s w e r e v e r y f r e q u e n t . Ps i n n o n - A R S w a s slightly h i g h e r (49%) t h a n Ps i n A R S (41%). It is m a i n l y d u e t o t h e s c a r c i t y o f t h e t r a n s i t i o n C -~ T in A R S . Obviously, these patterns differed f r o m that in pseudogenes. T h e discrepancy will be explained by the s e l e c t i v e f o r c e s o n f u n c t i o n a l M H C genes.

Third Codon Positions o f Functional M H C Genes At the third c o d o n positions o f functional M H C genes, t h e s u m o f n i j o c c u p i e d a b o u t 8 0 % o f M N R S in t h e s e sites ( T a b l e 5). T h e n u m b e r s o f s u b s t i t u t i o n s p e r n u c l e o t i d e site w e r e 2.6 t i m e s l a r g e r i n A R S t h a n in n o n - A R S . T h e c o m p a r i s o n o f substitution n u m bers between n o n - A R S and A R S indicates that the p r o p o r t i o n o f t r a n s i t i o n s is v e r y l o w i n A R S . A d r a s t i c d i f f e r e n c e w as o b s e r v e d b e t w e e n t h e s u b s t i t u t i o n p a t t e r n s at t h e t h i r d c o d o n p o s i t i o n s o f n o n - A R S a n d A R S ( T a b l e 6). I n n o n - A R S , e v e r y t r a n s i t i o n a l s u b s t i t u t i o n w as v e r y f r e q u e n t . I n A R S , h o w e v e r , all t r a n s i t i o n s w e r e q u i t e l i m i t e d . I n s t e a d , t h e t r a n s v e r s i o n s b e t w e e n G a n d C w e r e so f r e q u e n t t h a t t h e y s u m m e d u p t o 4 8 % o f all s u b s t i t u t i o n s . T h e difference b e t w e e n n o n - A R S a n d A R S w as m o s t evident in the p r o p o r t i o n o f transitions. In non-

201 ARS, P, was as large as 76%, but it was only 21% in ARS. In particular, Ps was only 5.7% in HLA-A. Compared with Ps in pseudogenes, P~ in non-ARS was much larger, but Ps in ARS was extremely small. Correlation between Substitution Patterns in Various Genes and Gene Regions As seen above, each gene region had its characteristic pattern ofnucleotide substitutions. To quantify the similarities of substitution patterns among different genes, we calculated the coefficients of correlation between different sets off~j values (Table 7). We compared the patterns in the following genes: the ARS region of HLA-A and HLA-B, the nonARS region of HLA-A and HLA-B, HLA-H, mammalian nuclear pseudogenes, mammalian nuclear functional genes, and n-globin spacer region. Table 6. Relative substitution frequencies, fjj, the proportion o f transitions, Ps, a n d the e q u i l i b r i u m base contents, ~(i), at the third c o d o n positions o f class I M H C genes N o n - A R S region

A R S region

HLA-A HLA-B A&B

HLA-A HLA-B A&B

0.0 17.5 26.3 0.0 15.0 0.0 5.2 12.1 1.7 15.8 1.6 4.7

0.0 0.0 19.6 0.0 28.8 0.0 1.3 17.8 3.8 13.7 2.5 12.5

0.0 6.3 21.9 0.0 23.3 0.0 3.0 15.9 3.0 15.1 2.2 9.3

0.0 0.0 0.0 23.7 0.0 0.0 6.2 0.0 24.7 5.7 11.4 28.4

0.0 13.3 13.3 0.0 13.4 0.0 5.1 0.0 20.6 2.9 5.7 25.6

0.0 9.1 9.1 8.4 8.4 0.0 5.5 0.0 22.0 3.8 7.5 26.3

Ps

69.2

80.0

76.2

5.7

29.6

21.2

Or(A) ~'(T) ~'(C) Or(G)

10.3 34.2 40.1 15.4

13.2 28.2 43.7 14.8

12.7 30.1 42.0 15.3

11.8 13.2 44.0 31.1

22.7 12.3 37.5 27.4

A A A T T T C C C G G G

a

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

T C G A C G A T G A T C

--a _a --a --"

Selective Forces Operating on Class I M H C Genes As mentioned before, substitution patterns can be used to specify the type of selective forces operating on genes. If the pattern observed in a gene differed from the pattern of Spontaneous substitution mutations, natural selection or functional constraint is thought to be operating on the gene. The selective forces on functional genes affect especially the patterns at the third codon positions due to the nature of genetic codes. Considering the genetic codes, it is obvious that most transitions at the third codon positions do not cause amino acid replacements with a few exceptions. In contrast, roughly half of trans-

Coefficients o f correlation a m o n g s u b s t i t u t i o n patterns a 1

1 2 3 4 5 6 7 8

Discussion

E q u i l i b r i u m base c o n t e n t s could n o t be calculated

Table 7.

The largest value of correlation coefficients was found to be the one between the patterns in HLA-H and mammalian pseudogenes. In this case, the correlation coefficient was 0.87. Because both genes are free from any selective forces, the patterns in these genes should be close to the pattern of spontaneous substitution mutations. Thus, the high correlation is reasonable. ~-globin spacer region is also thought to be free from any selective forces. As expected, the correlation coefficients among HLA-H, pseudogenes, and ~-globin spacer region were very high. The pattern at the first and second codon positions of non-ARS was correlated with the patterns in many other genes except ARS. The pattern at the third codon positions of non-ARS was highly correlated with the patterns in HLA-H, pseudogenes, and n-globin spacer. However, the patterns in ARS were not correlated with those in any other genes. In particular, negative correlation was found between the pattern at the third codon positions of ARS and the patterns in pseudogenes and ~-globin spacer region. This observation indicates that the patterns in ARS, both at the first and second codon positions and at the third codon positions, have been tremendously skewed by selective forces.

N o n - A R S , first a n d second ARS, first a n d second N o n - A R S , third ARS, third HLA-H Pseudogenes Functional genes n-globin spacer

0.16 0.59 0.44 0.73 0.50 0.61 0.53

2

0.29 0.04 -0.14 - 0.13 -0.09 0.18

3

0.02 0.59 0.56 0.38 0.85

4

0.07 -0.39 0.30 -0.25

5

6

7

0.87 0.66 0.75

0.50 0.84

0.57

" N u m b e r s in the first c o l u m n represent the patterns o f the following genes: 1) A&B, n o n - A R S at the first a n d second c o d o n positions; 2) A&B, A R S at the first a n d second c o d o n positions; 3) A&B, n o n - A R S at the third c o d o n positions; 4) A&B, A R S at the third c o d o n positions; 5) H L A - H ; 6) pseudogenes; 7) functional genes at the first a n d second c o d o n positions; 8) n-globin spacer

202 versions at the third codon positions do cause amino acid replacements. On the other hand, most substitution mutations, both transitions and transversions, at the first and second codon positions cause amino acid replacements. These facts enable us to predict the expected patterns ofnucleotide substitutions when selective forces are operating on genes. If amino acid replacements are accelerated by positive natural selection, it is expected that the number of substitutions at the first and second codon positions increases, and that the proportion oftransversions increases at the third codon positions. Meanwhile, if amino acids are conserved by functional constraint (negative selection against amino acid replacements), the proportion of transitions may increase at the third codon positions because most transitions are synonymous and half of transversions are nonsynonymous. The observed pattern in the ARS region of functional class I M H C genes clearly indicates that amino acid replacements are accelerated in this gene region. The proportion of transitions at the third codon positions was extremely small. In addition, the number o f substitutions was larger in ARS than in non-ARS, particularly at the first and second codon positions. These observations agree very well with our prediction and support Hughes and Nei's (1988) idea that positive selection is operating on ARS. On the other hand, Ps at the third codon positions of non-ARS was very high compared with Ps in pseudogenes, indicating that functional constraint is imposed on non-ARS. The substitution pattern in H L A - H was quite similar to that in pseudogenes. This fact is important in understanding the patterns in functional MHC genes. Namely, it confirms that the pattern of spontaneous mutations in the MHC gene region is not so different from that in other genes. Consequently, if we assume that the pattern of spontaneous mutations is uniform throughout the gene region, we can attribute a peculiar pattern observed in functional MHC genes not to biased mutation pressure but to natural selection and functional constraint. The substitution pattern in the ARS region of functional class I MHC genes possibly is a typical pattern in genes on which natural selection favoring amino acid replacements is operating. There may be other genes whose pattern of nucleotide substitutions is similar to the pattern in the ARS region. The putative ARS regions of class II MHC genes are the candidates, because the rate of nonsynonymous substitutions is enhanced like class I genes (Hughes and Nei 1989).

Errors in Estimating the Patterns In estimating the patterns of nucleotide substitutions, there are three possible sources of errors. First,

if multiple substitutions on a branch in phylogenetic trees have frequently occurred, it leads to serious errors in counting the number of substitutions. Second, gene conversions and recombinations may distort the topology ofphylogenetic trees and thus affect estimation of the patterns. Finally, the sampling error is serious if the substitution number that could be counted was very small. In the following, we will evaluate the effect of each factor. Multiple Substitutions Because we used the maximum parsimony method, every difference in nucleotides between a node and its descendant was attributed to a single substitution. However, there is a possibility that multiple substitutions have occurred on some branches of phylogenetic trees. The length of most branches, which are proportional to the number of nucleotide substitutions, did not exceed 0.02 per site (Figs. 1 and 2). Even when we considered the ARS region, the number is as much as 0.1 per site. If we suppose that the number (n) of nucleotide substitutions follows the Poisson distribution with the mean being 0.1, i.e., p(n; O. 1) = 0.1 nexp(-0.1)/n!, relative frequencies of nucleotide sites having no, single, and multiple substitutions will be p(0) = 0.905, p(1) = 0.090, and p(->2) = 0.005, respectively. Then, the proportion of single substitutions among all substitutions is 90% [= p(1)/0.1]. Thus, the majority of nucleotide substitutions is attributed to single substitutions. The smaller the mean number of substitutions is, the larger the proportion of single substitutions is. For example, the proportion of single substitutions will be 98% if the mean is 0.02 per site. Therefore, the effect o f multiple substitutions on the estimated patterns ofnucleotide substitutions may be negligible. Gene Conversion and Recombination Polymorphisms of class I MHC genes are thought to have been generated by multiple mechanisms: substitution mutations, gene conversions, and recombinations within coding regions (Parham et al. 1989). In particular, gene conversion between homologous alleles and recombination within a coding region seem to have occurred frequently in the evolution of M H C genes. For example, HLA-B51 seems to have arisen by inserting a short gene segment of HLA-B8 into HLA-Bw52 (Hayashi et al. 1989), and HLA-Aw69 is supposed to be a recombinant of HLA-Aw68 and HLA-A2 (Holmes and Parham 1985). Because such events seem to have occurred multiple times, it was impossible to identify the alleles that have arisen through either gene conversions or recombinations with certainty and precision. As a result, we are obliged to use all available DNA sequences. Accordingly, the pattern estimated

203 with our procedure is the pattern o f nucleotide changes due to all o f the substitutions, gene conversions, and recombinations. However, the difference in nucleotides due to gene conversion and rec o m b i n a t i o n m u s t have been generated originally by substitution mutations. Thus, the estimated pattern is expected to be close to the pattern o f substitution mutations.

Sampling Errors W h e n the n u m b e r o f substitutions we can c o u n t is very small, sampling errors seriously affect the estimated pattern. Because the n u m b e r o f counted substitutions is inevitably finite, the estimated pattern m a y deviate from the real pattern. Thus, we should take into account the standard errors o f relative substitution frequencies. Let us take as an example the pattern at the third c o d o n positions o f A R S in functional H L A genes and estimate statistically the confidence interval o f Ps. We could count 39 substitutions in all at the third c o d o n positions o f A R S (Table 5). A s s u m i n g that the true value o f P, is 0.5 and that the observed frequencies follow the n o r m a l distribution, the expected value o f Ps will be within the interval between 34% and 66% with a probability o f 95%. Similarly, the true value o f Ps in A R S is estimated to be within the interval between 8% and 34°/0 with a probability o f 95%, as the observed value o f P~ was 21%. Thus, the confidence interval is n a r r o w enough to conclude that the p r o p o r t i o n o f transitions at third c o d o n positions is significantly reduced.

Substitution Patterns and Genomic Evolution Our analysis o f substitution patterns in M H C genes reveals some i m p o r t a n t features o f genomic evolution. In pseudogenes, the m o s t frequent type o f substitution was the transition C -~ T (Table 1). One o f the reasons for this observation m a y be explained as follows. In the m a m m a l i a n genome, a majority o f the nucleotide C at C p G dinucleotides is m e t h ylated. Methylated C tends to deaminate and turn into T (Coulondre et al. 1978). Thus, the transition C -~ T frequently occurs in the m a m m a l i a n genome. This is also the reason that C p G dinucleotides are rare in the m a m m a l i a n genome. The transition C -~ T in A R S o f functional M H C genes was less frequent c o m p a r e d with that in n o n - A R S (Tables 4 and 6). It is understandable if we note that M H C genes constitute H T F islands (Tykocinski and Max 1984; Pontarotti et al. 1988). A n H T F island has a cluster o f C p G dinucleotides and is usually u n m e t h ylated (Bird 1986). Thus, the scarcity o f m e t h y l a t e d C in n o n - A R S m a y have resulted in a reduction o f the transitions C -~ T and G -~ A (because the tran-

sition C -~ T is associated with the transition G -~ A in the c o m p l e m e n t a r y strand). Nuclear genomes o f w a r m - b l o o d e d vertebrates are c o m p o s e d o f the isochores, in which G C content is quite h o m o g e n e o u s along m o r e than a few hundred t h o u s a n d bases (Bernardi et al. 1985). The class I M H C region is thought to lie in a GC-rich isochore because sequences a r o u n d m o s t class I genes are G C rich (Ikemura and A o t a 1988). The H L A - H sequence itself was also G C rich (Table 2). Base contents expected at the equilibrium state in H L A - H genes, however, were highly A T rich. In pseudogenes and n-globin spacer, expected base contents were also A T rich. This suggests that the substitution pattern m a y be roughly u n i f o r m irrespective o f the location or base contents o f the gene. O u r observation implies that the origin o f i s o c h o r e s with various G C contents m a y not be explained solely by the difference in substitution patterns.

Acknowledgments. We thank Y. Ina for help with the computation. We also thank reviewers for valuable comments.

References Bernardi G, Olofsson B, Filipsld J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F (1985) The mosaic genome of warm-blooded vertebrates. Science 228:953-958 Bird AP (1986) CpG-rich islands and the function of DNA methylation. Nature 321:209-213 Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987a) Structure of the human class I histocompatibility antigen, HLA-A2. Nature 329:506-512 Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987b) The foreign antigen binding site and T cell recognition regions of class I histocompatibility antigens. Nature 329:512-518 BodmerWF (1972) EvolutionarysignificanceoftheHL-Asystem. Nature 237:139-145 Choo SY, John TS, Orr HT, Hansen JA (1988) Molecular analysis of the variant alloantigen HLA-B27d (HLA-B*2703) identifies a unique single amino acid substitution. Hum Immunol 21:209-219 Cianetti L, Testa U, Scotto L, La Valle R, Simeone A, Boccoli G, Giannella G, Peschle C, Boncinelli E (1989) Three new class I HLA alleles: structure of mRNAs and alternative mechanisms of processing, lmmunogenetics 29:80-91 Coulondre C, Miller JH, Farabaugh PJ, Gilbert W (1978) Molecular basis of base substitution hotspots in Escherichia coil Nature 274:775-780 Cowan EP, JelachichML, BiddisonWE, ColiganJE (1987) DNA sequence of HLA-A11: remarkable homology with HLA-A3 allows identification of residues involved in epitopes recognized by antibodies and T cells. Immunogenetics 25:241-250 Ennis PD, Zemmour J, Salter RD, Parham P (1990) Rapid cloning of HLA-A, B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification. Proc Natl Acad Sci USA 87:2833-2837 Epstein H, Kennedy LJ, Holmes N (1989) An Oriental HLAA2 is closelyrelated to a subset of Caucasoid HLA-A2 alleles. Immunogenetics 29:112-116 Fitch WM (1971) Toward defining the course of evolution:

204 minimum change for a specific tree topology. Syst Zool 20: 4O6-4 16 Gojobori T, Li WH, Graur D (1982) Patterns of nucleotide substitution in pseudogenes and functional genes. J Mol Evol 18:360-369 Goodman M, Koop BF, Czelusniak J, Fitch DHA, Tagle DA, Slightom JL (1989) Molecular phylogeny of the family of apes and humans. Genome 31:316-335 Hayashi H, Ennis PD, Ariga H, Salter RD, Parham P, Kano K, Takiguchi M (1989) HLA-B5t and HLA-Bw52 differ by only two amino acids which are in the helical region of the al domain. J Immunol 142:306-311 Hill AVS, Allsopp CEM, Kwiatkowski D, Anstey NM, Twnmasi P, Rowe PA, Bennett S, Brewster D, McMichael AJ, Greenwood BM (1991) Common west African HLA antigens are associated with protection from severe malaria. Nature 352: 595-600 HolmesN, ParhamP (1985) Exon shuffling in vivo can generate novel HLA class I molecules. EMBO J 4:2849-2854 Holmes N, Ennis P, Wan AM, Denney DW, Parham P (1987) Multiple genetic mechanisms have contributed to the generation of the HLA-A2/A28 family of class I MHC molecules. J Immunol 139:936-941 Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335:167-170 Hughes AL, Nei M (1989) Nucleotide substitution at major histocompatibility complex class II loci: evidence for overdominant selection. Proc Natl Acad Sci USA 86:958-962 Ikemura T, Aota S (1988) Global variation in G + C content along vertebrate genome DNA. J Mol Biol 203:t-13 Kato K, Dupont B, Yang SY (1989a) Localization ofnucleotide sequence which determines Mongoloid subtype ofHLA-B 13. Immunogenetics 29:117-120 Kato K, Trapani JA, Allopenna J, Dupont B, Yang SY (1989b) Molecular analysis of the serologically defined HLA-Awl 9 antigens: a genetically distinct family of HLA-A antigens comprising A29, A31, A32, and Aw33, but probably not A30. J Immunol 143:3371-3378 KimuraM (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111-120 Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge Klein J (1986) Natural history of the major histocompatibility complex. John Wiley & Sons, New York Koller BH, Geraghty D, Orr HT, Shimizu Y, DeMars R (1987) Organization of the human class I major histocompafibility complex genes. Immunol Res 6:1 Kottmann AH, Seemann HA, Guessow HD, Roos MH (1986) DNA sequence of the coding region of the HLA-B44 gene. Immunogenetics 23:396-400 Lawlor DA, Ward FE, Ennis PD, Jackson AP, Parham P (1988) HLA-A and B polymorphisms predate the divergence of humans and chimpanzees. Nature 335:268-271 Li WH, Wu CI, Luo CC (1984) Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications. J Mol Evol 21:58-71 Malissen M, Malissen B, Jordan BR (1982) Exon/intron organization and complete nucleotide sequence of an HLA gene. Proc Natl Acad Sci USA 79:893-897 Mayer WE, Jonker M, Klein D, Ivanyi P, Seventer G, Klein J (1988) Nucleofide sequences of chimpanzee MHC class I

alleles: evidence for trans-species mode of evolution. EMBO J 7:2765-2774 Miiller CA, Engler-Blum G, Gekeler V, Steiert I, Weiss E, Schmidt H (1989) Genetic and serological heterogeneity of the supertypic HLA-B locus specificities Bw4 and Bw6. Immunogenetics 30:200-207 N'Guyen C, Sodoyer R, Trucy J, Strachan T, Jordan BR (1985) The HLA-AW24 gene: sequence, surroundings and comparison with the HLA-A2 and HLA-3 genes. Immunogenetics 21:479--489 Ooba T, Hayashi H, Karaki S, Tanabe M, Kano K, Takiguchi M (1989) The structure of HLA-B35 suggests that it is derived from HLA-Bw58 by two genetic mechanisms. Immunogenetics 30:76-80 Parham P, Lawlor DA, Lomen CE, Ennis PD (1989) Diversity and diversification of HLA-A, B, C alleles. J Immunol 142: 3937-3950 Pohla H, Kuon W, Tabaczewski P, Doemer C, Weiss EH (1989) Allelie variation in HLA-B and HLA-C sequences and the evolution of the HLA-B alleles. Immunogenetics 29:297-307 Pontarotti P, Chimini G, N'guyen C, Boretto J, Jordan BR (1988) CpG islands and HTF islands in the HLA class I region: investigation of the methylation status of class I genes leads to precise physical mapping of the HLA-B and C genes. Nucleic Acids Res 16:6767-6778 Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 20:1-10 Seemann GHA, Rein RS, Brown CS, Ploegh HL (1986) Gene conversion-like mechanisms may generate polymorphism in human class I genes. EMBO J 5:547-552 Strachan T, Sodoyer R, Damotte M, Jordan BR (1984) Complete nucleotide sequence of a functional class I gene, HLAA3: implications for the evolution of HLA genes. EMBO J 3: 887-894 Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J Mol Evol 18:115-120 The WHO Nomenclature Committee for factors of the HLA system (1990) Nomenclature for factors ofthe HLA system, 1989. Immunogenetics 31:131-140 Trapani JA, Mizuno S, Kang SH, Yang SY, Dupont B (1989) Molecular mapping of a new public HLA class I epitope shared by all HLA-B and HLA-C antigens and defined by a monoclonal antibody. Immunogenetics 29:25-32 Tykocinski ML, Max EE (1984) CG dinucleotide clusters in MHC genes and in 5' demethylated genes. Nucleic Acids Res 12:4385-4396 Ways JP, Coppin HL, Parham P (1985) The complete primary structure of HLA-Bw58. J Biol Chem 260:11924-11933 WrightS (1969) Evolution and the genetics ofpopulations, vol 2. The theory of gene frequencies. University of Chicago Press, Chicago, p 26 Zemmour J, Ennis PD, Parham P, Dupont B (1988) Comparison of the structure of HLA-Bw47 to HLA-B13 and its relationship to 21-hydroxylase deficiency. Immunogenetics 27: 281-287 Zemmour J, Kolter BH, Ennis PD, Geraghty DE, Lawlor DA, OrrHT, ParhamP (1990) HLA-AR, an inactivated antigenpresenting locus related to HLA-A. J Immunol 144:36193629 Received June 19, 1991/Revised April 3, 1992

Patterns of nucleotide substitutions inferred from the phylogenies of the class I major histocompatibility complex genes.

Patterns of nucleotide substitutions in human major histocompatibility complex (MHC) class I genes were estimated by using phylogenetic trees of DNA s...
923KB Sizes 0 Downloads 0 Views