J. theor. Biol. (1978) 72, 743-749

Is the “Difference Index” Useful for Assessment of Protein Relatedness? DAVID

R. WOODWARD

Biochemistry Department, University of Tasmania, Hobart 7001, Australia (Received 6 September 1977, and in revisedform 30 January 1978)

This paper assessesthe usefulness of the Difference Index (Metzger, Shapiro, Mosimann & Vinton, 1968) for predicting, from amino acid compositions, whether two proteins are related or unrelated. It is concluded that, with a 5% probability of false identification, Difference Index values less than 10.0 indicate relatedness and values greater than 26.8 indicate unrelatedness. A large proportion of protein pairs have Difference Index values in the region 100-26.8 and cannot therefore be reliably identified as related or unrelated by this criterion. 1. Introduction If two proteins have substantial similarity in their amino acid sequence, they are considered to be “homologous”, i.e. derived from a common ancestor. However, the amino acid sequences of many proteins are unknown, and for these one must use weaker criteria for assessing homology. Such criteria include : similarities in tryptic peptide maps, catalytic properties, amino acid composition or molecular weight. Assessments of homology from amino acid composition have usually been impressionistic-the compositions either “look similar” or they “look different”. A more objective assessment can be made using the Difference Index introduced by Metzger, Shapiro, Mosimann & Vinton (1968), which takes into account the differences in the content of each amino acid between two proteins. The Difference Index (D.I.) is calculated as follows: The amino acid composition of each protein is expressed as residues per 100 residues. Values for glu, gln and glx are pooled; asp, asn, asx are pooled. If protein A has Xi, residues of amino acid i per 100 residues, and there are n amino acids (usually 18) present in at least one of the two proteins (A and B) being compared, then Difference Index = 0.5 i$‘1 lXiA - X,,]. 743

OO22-5193/78/062@-0743 %01.00/O

Q 1978 Academic Press Inc. (London) Ltd.

744

D.

R.

WOODWARD

Having obtained such a value, one must then interpret it. Metzger et al. (1968) provided the distribution of Difference Index values in a sample of pairs of unrelated proteins (values ranged from 9.1 to 54.0). Using these data, and the intuitive assumption that related pairs would have lower Difference Index values than unrelated pairs, several authors, e.g. Kuroda (1972), Holroyde & Trayer (1976), have used low Difference Index values as evidence for the homology of proteins. The value of such “evidence” is weakened by two considerations: (a) the sample of unrelated pairs by Metzger et al. (1968) shows distinct bias (at least half the proteins used came from a single species, only 10 % came from non-mammals, and some of the “unrelated” proteins were in fact homologous); (b) the distribution of Difference Index values among related pairs has not been established. This paper investigates the distribution of Difference Index values among pairs of homologous proteins and re-investigates the distribution among pairs of unrelated proteins. Critical values are established which allow the identification of protein pairs as “homologous” or “non-homologous” with a defined low risk of false identification. 2. The Selection of Samples

Most statistical procedures depend on random sampling. While large numbers of proteins have been analysed, it is uncertain whether they constitute a representative sample of all naturally occurring proteins. Certainly, only a small and unrepresentative range of species has been used so far for protein studies. Therefore, selection of a sample of proteins for a study such as this cannot be strictly random. Group A comprised: Ranapipiens actin (Carsten & Katz, 1964); Bos taurus rc-casein B (Mercier, Grosclaude & Ribadeau-Dumas, 1973); Cochliomyia hominivorax cytochrome c (Dayhoff & Eck, 1968); T4 Bacteriophage dihydrofolate reductase (E.C. 1.5.1.4) (Erickson & Matthews, 1973); Staphylococcus aureus enterotoxin c (Huang, Shih, Borja, Avena & Bergdoll, 1967); Desulfovibrio gigus ferredoxin (Travis, Newman, Legal1 & Peck, 1971); Saccharomycescerevisiae fructose-bisphosphate aldolase (E.C.4.1.2.13) (Harris, Kobes, Teller, & Rutter, 1969); Cyprinus carpio haemoglobin cc-chain (Dayhoff & Eck, 1968); Tetrahymaena pyriformis histone I (Hamana & Iwai, 1971); Gadus callarias insulin (Dayhoff & Eck, 1968); Gallus domesticuslysozyme (E.C. 3.2.1.17)

ASSESSMLN

1’ OI:

l’RO1ClN

RELATEDNLSS

745

(Brew & Hill, 1975); Anabaena uariabilis plastocyanin (Aitken, 1975); Rattus rattus serum albumin (Peters, 1962) ; Canaoalia ens&ormis urease (E.C.3.5.1.5) (Milton & Taylor, 1969); Squalus acanthius adrenocorticotropin (Lowry, Bennett, McMartin & Scott, 1974). The 15 proteins in group A, so far as could be established, are reasonably homogeneous and show no significant homology with each other. They are diverse both in terms of protein function and of taxonomic categories (15 species, 15 orders, 7 are vertebrates). A total of 105 pairs (“combinations”) was generated from this group. Group B comprised: (1) Adrenocorticotropin from Bos taurus, Homo sapiens, Squalus acanthias, and Sus scrofa, and a-melanotropin from Squalus acanthias (Lowry et al., 1974). (2) Chymotrypsins A and B, i.e. residues 16-245 of the corresponding chymotrypsinogen, (E.C. 3.4.21 .I), from Bos taurus; elastase (E.C. 3.4.21.1 I) from Sus scrofa; and trypsin (E.C. 3.4.21.4) from Bos taurus and Squalus acanthius (Shotton & Hartley, 1970, except for Squaha trypsin from Titani, Ericsson, Neurath & Walsh, 1975). (3) Cytochrome b5 from Alouatta fusca, Gallus domesticus, Homo sapiens, Oryctolagus cuniculus and Sus scrofu (Nobrega & Ozols, 1971). (4) Cytochrome c from Candida krusei, Chelydra serpentina, Macropus kanguru, Rhanianectes glaucus, and Triticum vulgare (Dayhoff & Eck, 1968). (5) Ferredoxin from Colacasia esculenta, Leucaena glauca, Medicago sativa, Spinacia oleracea and Spirulina maxima (Tanaka et al., 1975). (6) Haemoglobin a-chain from Cyprinus carpio and Mus musculus; haemoglobin b-chain from Homo sapiens and Ovis aries; and myoglobin from Physeter catodon (Dayhoff & Eck, 1968). (7) Insulin (both A and B chains) from Baluenoptera borealis, Equus cabal& Gadus callarias, Oryctolagus cuniculus, and Rattus rattus (Dayhoff & Eck, 1968). (8) cr-Lactalbumin from Bos taurus and Cuuia porcellus; lysozyme (E.C. 3.2.1.17) from Anas platyrhynchos (type II), Gallus domesticus and Homo sapiens (Brew & Hill, 1975, except for Anus lysozyme from Hermann, Jollb & Joll&s, 1971). (9) Plastocyanin from Anabaena variabilis, Chlorella jiisca, Cucurbita pepo, Phaseolus vulgaris and Solarium tuberosum (Aitken, 1975, except for Cucurbita from Scawen & Boulter, 1974). (IO) Ribonuclease I (E.C. 3.1.4.22) from Camelus dromedarius, Equus caballus, Giraffa camelopardalis, Rangifer tarandus and Rattus rattus (Welling, Green & Beintema, 1975).

746

D.

R.

WOODWARD

Group B consists of 50 proteins whose amino acid sequence has been fully established. There are five proteins from each of ten protein “families”; the ten families do not show any substantial homology with each other. The proteins are diverse both in terms of protein function and of taxonomic categories (35 species, 23 orders; 38 are vertebrates). For each family, all ten pairs were generated, yielding a total of 100 homologous pairs. These pairs vary in the extent of their homology, ranging from 98 % sequence identity to 20 % sequence identity. 3. Difference Index Values of the Two Samples The sample of unrelated pairs (group A) gave mean D.I. of 26.4 with standard deviation 8.6; the median value was 28.0, and the range 7-5-45.3. As judged by the chi-square test, the hypothesis of a normal population was not rejected at the 0.05 level. The sample of unrelated pairs (group B) gave mean D.I. of 13.7 with standard deviation 6.8; the median value was 13.7 and the range OG29.0. As judged by the chi-square test, the hypothesis of a normal population was not rejected at the 0.05 level. The two sample t-test indicates that the mean of the related pairs is lower than that of the unrelated pairs, at the 0.05 level of significance, and even at the O-001 level. (Strictly, the t-test assumes that the two populations have equal variances; an F-test rejected the hypothesis of equal variances at the O-05 level, but not at the 0.02 level.) 4. Tolerance Liits

for the Two Populations

Clearly, a high D.I. suggests non-homology, a low D.I. homology. We may refine this statement by establishing values a (which is exceeded by 95 % of unrelated pairs) and b (exceeded by 5 % of related pairs). Thus, unknown pairs with a D.I. less than a are probably related, and those with a D.I. greater than b are probably unrelated-the risk of false identification in each case being 5%. Values between a and b cannot be confidently assigned to either category. Point estimates of a and b are the 5th centile of the unrelated sample and the 95th centile of the related sample. However, these are subject to a degree of sampling error which can be allowed for by use of one-sided tolerance limits (Crow, Davis & Maxfield, 1960; section 4.9). Using tolerance limits with a confidence coefficient of 0.95, we find a to be 10.0 and b 26.8. Less stringent limits may be detied, e.g. c (exceeded by 90% of unrelated pairs) and d (exceeded by 10% of related pairs). Using tolerance intervals with confidence coefficient O-95. we find c to be 13.4 and d 24.1.

ASSESSMENT

OF

PROTEIN

RELATEDNESS

747

Thus with confidence coefficient O-95 and risk of false assignment 0.05, a D.I. less than 10.0 implies homology and greater than 26.8 implies nonhomology. If we relax the risk of false assignment to O-10, a D.I. less than 13.4 implies homology and greater than 24.1 implies non-homology. In each case, intermediate values cannot be reliably assigned to either category. 5. Assessment of Validity using Fresh Samples The reliability of these criteria depends mainly on the randomness of the original samples. While this cannot be directly assessed, an approximate check may be made by taking fresh samples of related and unrelated pairs and determining whether or not their D.I. values fit the criteria proposed. Group C comprised: Bos tuurus cl,,-casein B (Mercier, Brignon & Ribadeau-Dumas, 1971); Homo sapiens blood coagulation factor IX (DiScipio, Hermodson, Yates & Davie, 1977); Staphylococcus aureus penicillinase (E.C. 3.5.2.6) (Ambler, 1975); Alaria esculenta cytochrome f (Laycock, 1975); Saccharomyces cereuisiue alcohol dehydrogenase (E.C. 1.1.1.1) (Jiirnvall, 1977).

This fresh sample of unrelated pairs comprised five unrelated proteins; all 10 pairs of these were used. The five proteins came from 5 species, representing 5 orders; 3 were non-vertebrates. One pair gave a D.I. >26.8; four gave D.I. values 224.1; all gave values >13.4. Group D comprised:

Haemoglobin a-chain from Oryctolagus cuniculus and Equus caballus (Dayhoff & Eck, 1968); Insulin, A and B chains from Homo sapiens and Gullus domesticus (Dayhoff & Eck, 1968); Cytochrome c from Triticum vulgare and Cochliomyia hominiuorux (Dayhoff & Eck, 1968); Bence Jones protein I-chain from Homo sapiens, SH type and HA type (Dayhoff & Eck, 1968); Fibinopeptide A from Canis familiaris and Meles meles (Dayhoff & Eck, 1968); (Bacteria-) ferredoxins from Clostridium pasteuranium and Cl. butyricum (Dayhoff & Eck, 1968); rc-casein from Ouis aries (KA) (Jolles, Fiat, Schoentgen, Alais & Jollb, 1974) and Bos tuurus (KB) (Mercier et al., 1973); Cytochrome f from Euglena gracilis and Alaria esculenta (Laycock, 1975); Triosephosphate isomerase (E.C. 5.3. I. 1) from Oryctolagus cuniculus and Latimeria chalumnae (Corran & Waley, 1975); Alcohol dehydrogenase (E.C. 1.1.1.1) from Equus cuballus and Succharomyces cereuisiae (Jornvall, 1977). This fresh sample of related pairs comprised one pair from each of 10 unrelated protein families. They came from 16 species, representing 13 orders;

748

1).

R.

WOODWARD

7 of the species were non-vertebrates. The extent of homology ranged from around 85 % sequence identity to around 25 %. Four pairs gave D.I. values less than 10-O; eight gave values less than 13.4; all gave values less than 24.1. Thus, whether we apply the criteria corresponding to the 0.05 or the 0.10 false assignment risk, none of the pairs in the two fresh samples yielded a false assignment. We would regard the fresh samples as inconsistent with the earlier samples, if the probability of zero false assignments in a sample of 10 pairs was less than 0.05. In fact, the binomial distribution predicts that the probability of zero false assignments in a sample of 10 pairs should be 0.60 (for a false assignment risk of 0.05) or 0.35 (for a false assignment risk of O-10). Clearly, then, our fresh samples do not invalidate the criteria established from our earlier samples. 6. Conclusion The results presented in this paper indicate that the D.1. can be used reliably as a screening test for detecting protein interrelationships. If one is prepared to allow one error per 20 predictions, a D.I. less than 10.0 provides evidence of homology that should not be dismissed without good evidence; a D.I. greater than 26.8 similarly provides evidence for non-homology. If one allows one error per 10 predictions, the critical values become: < 13.4 for homology; >24*1 for non-homology. Clearly, intermediate values do not allow prediction of either homology or non-homology; estimates based on the samples are that the range lOG26.8 includes approximately 60 % of related pairs and 40% of unrelated pairs, while the range 134-24-l includes approx. 40% of both types. Note: Amino acid composition data were obtained from primary sources. To avoid an over-long list of references, I have (where possible) referenced data to Dayhoff & Eck (1968) or to the most recent paper of a series; original references can be easily identified from these sources.

REFERENCES AITKEN, A. (1975). Biochem. J. 149, 675. AMBLER, R. P. (1975). Biochem. J. 151, 197. BREW, K. & HILL, R. L. (1975). Rev. Physiol. Biochem. Pharmacol. 72, 105. CARSTEN, M. E. & KATZ, A. M. (1964). Biochim. biophys. Acta 90, 534. CORRAN, P. H. & WALEY, S. G. (1975). Biochem. J. 145, 335. CROW, E. L., DAVIS, F. A. & MAXFIELD, M. W. (1960). Statistics Manual. New York: Dover Publications. DAYHOFF, M. 0. & ECK, R. V. (1968). Atlas of Protein Sequence and Structure 1967-1968. Silver Springs: National Biomedical Research Foundation. DISCIPIO, R. G., HERMODSON, M. A., YATES, S. G. & DAVIE, E. W. (1977). Biochemistry 16, 698.

ASSESSMENT

OF

PROTEIN

RELATEDNESS

749

ERICKSON, J. S. & MATTHEWS, C. K. (1973). Biochem. 12,372. HAMANA, K. & IWAX, K. (1971). J. Biochem. 69, 1097. HARRIS, C. E., KOBES, R. D., TELLER, D. C. & RUTTER, W. J. (1969). Biochem. 8,2442. HERMANN, J., Jout%, J. & JOLLIES, P. (1971). Eur. J. Biochem. 24, 12. HOLROYDE, M. J. & TRAYER, I. P. (1976). FEBS Lett. 62, 21.5. HUANG, I., SHIH, T., BORJA, C. R., AVENA, R. M. & BERGDOLL, M. S. (1967). Biochem. 6, 1480. JOLL&, J., FIAT, A. M., SCHOENTGEN, F., ALAIS, C. & JOLL~S, P. (1974). Biochim. biophys.

Acta 365, 335. JORNVALL, H. (1977). Eur. J. Biochem. 72, 425. KURODA, H. (1972). Biochim. biophys. Acta 285, 253. LAYCOCK, M. V. (1975). Biochem. J. 149, 271. LOWRY, P. J., BENNETT, H. P. J., MCMARTIN, C. & SCOTT, A. P. (1974). Biochem. J. 141, 427. MERCIER, J. C., GROSCLAUDE, F. & RIBADEAU-DUMAS, B. (1971). Eur. J. Biochem. 23, 44. MERCIER, J. C., BRIGNON, G. & RIBADEAU-DUMAS, B. (1973). Eur. J. Biochem. 35, 222. METZGER, H., SHAPIRO, M. P., MOSIMANN, J. E. & VINTON, J. E. (1968). Nature, Lond. 219,1166. MILTON, J. M. &TAYLOR, I. E. P. (1969). Biochem. J. 113, 678. NOBREGA, F. G. & OZOLS, J. (1971). J. biol. Chem. 246, 1706. PETERS, T. (1962). J. biol. Chem. 237,2182. SCAWEN, M. D. & BOULTER, D. (1974). Biochem. J. 143,257. SHOTTON, D. M. & HARTLEY, B. S. (1970). Nature, Lond. 225,802. TANAKA, M., HANIU, M., ZEITLIN, S., YASUNOBU, K. T., EVANS, M. S. W., TAO, K. N. & HALL, D. 0. (1975). Biochem. Biophys. Res. Comm. 64,399. TITANI, K., ERICSSON, L. H., NELJRATH, H. & WALSH, K. A. (1975). Biochem. 14, 1358. TRAVIS, J., NEWMAN, D. J., LEGALL, J. & PECK, H. D. (1971). Biochem. Biophys. Res.

Comm. 45, 452, WELLING,

G. W., GREEN,

G. & BEINTEMA,

J. J. (1975).

Biochem. J. 147, 505.

Is the "difference index" useful for assessment of protein relatedness?

J. theor. Biol. (1978) 72, 743-749 Is the “Difference Index” Useful for Assessment of Protein Relatedness? DAVID R. WOODWARD Biochemistry Departmen...
418KB Sizes 0 Downloads 0 Views