Nucleic Acids Research, Vol. 20, No. 1 131-136

A simple method for global

sequence

comparison

Elisabetta Pizzi*, Marcella Attimonelli, Sabino Liuni, Clara Frontali1 and Cecilia Saccone Centro Studi Mitocondri e Metabolismo Energetico CNR, Dipartimento di Biochimica e Biologia Molecolare, University of Bari, 70126 Bari and 1Laboratorio di Biologia Cellulare, Istituto Superiore di Sanita, 00161 Rome, Italy Received April 29, 1991; Revised and Accepted December 4, 1991

ABSTRACT A simple method of sequence comparison, based on a correlation analysis of oligonucleotide frequency distributions, is here shown to be a reliable test of overall sequence similarity. The method does not involve sequence alignment procedures and permits the rapid screening of large amounts of sequence data. It identifies those sequences which deserve more careful analysis of sequence similarity at the level of resolution of the single nucleotide. It uses observed quantities only and does not involve the adoption of any theoretical model.

INTRODUCTION The continually increasing amount of sequence data generated by the recent projects on genome sequencing necessitate high speed methods for the characterization and analysis of new sequences. When a new sequence is generated, a comparative analysis with a nucleic acid (or protein) sequence database is generally performed as a first step in the attempt to characterize its function. Several methods are available which perform this analysis by

detecting similarity through best-alignment procedures. As is now generally accepted (1 -3), when dealing with sequence comparisons, the term similarity is not equivalent to homology, a qualitative concept which includes a common evolutionary origin.

Several, widely used computer programs for similarity search (4-8), are based on variations of the scoring system introduced by Smith and Waterman (9). In order to detect biologically important similarities, these methods emphasize the relevance of those segments of the compared sequences which maximize the similarity score. They usually require the use of large computers in order to perform in a reasonable time alignmentbased screening of large databases. We investigated the possibility of using a global approach to efficiently pre-screen a large database, in order to rapidly identify those sequences which are related to a given one. We aimed at a user-friendly program, applicable to any type of sequence, that would extract from databases subsets of sequences deserving detailed analysis by more precise and meaningful alignment procedures. *

Global approaches which adopt 'macroscopic' parameters and reflect average sequence properties have been proposed by Blaisdell (10) and by Pietrokovsky, Hirshon and Trifonov (11). Both methods analyse the differences between observed and predicted global sequence properties, the predictions being based on a Markov chain model. These methods have the advantage of reducing the relevance of effects related to differential composition in mono-, di-, or others short oligonucleotides. It is, however, rather doubtful whether a Markovian model of defined order can represent any type of sequence. In order to eliminate the dependence on an arbitrary model, we exploited a suggestion present but not developed in the work of Brendel, Beckmann and Trifonov (12). The idea is to perform a comparative analysis using a test of overall similarity in oligonucleotide composition. This is most quickly done by calculating a correlation coefficient between frequency distributions of nucleotide strings of a given length (oligomers). Through a series of suitable tests it is shown here that this quick and simple macroscopic approach closely reflects the results of an alignment-based similarity search, and that the signal/noise ratio can be efficiently improved by an appropriate choice of oligomer length. The correlation method (CORRELA) yields reliable results in the rapid location of a subsequence related to a given fragment, and in the identification of clusters of phylogenetically related genes. The relation of the global parameter to overall base composition and 3rd codon position choices is discussed. A comparison with current methods (FASTA, 13) appears to confirm the efficacy of the CORRELA program.

Description of the algorithm Let X and Y be the two variables whose correlation has to be tested over a range of values [X] = XI, X2,..., Xk,..., XN and [Y]

=

Yl, Y2

.....,

"Yk

YN. The correlation index is defined

as

Lk

(Xk

Xm) (Yk Ym)

(1)

r=

[Ek

(Xk-Xm)2 Sk

(Yk-Ym)2]1/2

Xm and Ym being the average values of the two variables the given range of values.

To whom correspondence should be addressed at Laboratory of Cell Biology, Istituto

Superiore di Sanitt Rome, Italy

over

132 Nucleic Acids Research, Vol. 20, No. I In our case, Xk and yk are the occurrency numbers for the kth oligomer in sequences x and y respectively. If n is the length of the oligomer (expressed in bp), the number of different oligomers is N=4n and E kXk

Xm

L -n + 1

NkYk

YHll

=

N

4"

N

L,-n + I

4n

where Lx and Ly are the lengths of sequences x and y (expressed in bp) and L-n + 1 is the total number of oligomers of length n which can be accommodated in a sequence of length L, and is equal to the sum of the observed occurrencies. r is an adimensional parameter, ranging from -1 to + 1. However we are interested essentially in the positive range (0 < r < 1), a negative correlation being an unrealistic case. Note that the definition of r given by relation (1) would not change if frequencies were used in the place of occurrencies. At this point the following observations can be made: 1) If two sequences have a high similarity, their correlation coefficient is necessarily near to 1, except in the very unlikely case in which all the Xk are near to Xm, and all the yk are near to Ym' i.e. both sequences have a flat oligonucleotide distribution. 2) A high correlation coefficient can also be found in the case in which the two sequences share little sequence similarity but have similar frequency distributions of oligomers. The probability of such an event obviously decreases when the n value is increased. However, the higher n is, the longer the computational time. In practice, it is convenient to use a value of n ranging from 3 to 5. 3) By definition, r does not depend on the length of the compared sequences. Nevertheless, care must be exercised when comparing sequences of different length, as detailed later. We shall assume in the course of this work that a significant correlation corresponds to r > 0.5. The validity of this assumption is substantiated by the tests described in the following sections. In conclusion, r([X],[Y]) is a global index which expresses a correlation in the deviations from the average oligomer frequencies in two sequences. It obviously does not coincide with similarity indices based on the local comparison of aligned sequences, but rather reflects a similarity in the oligomer frequency distribution. We will therefore use the terms 'overall similarity' and 'microscopic similarity' to distinguish between the two concepts. The correlation coefficient defined in relation (1) can be used to give a numerical indication of overall similarity, approaching the unity value when the two sequences have a high degree of microscopic similarity. Compared with an index expressing microscopic sequence similarity, it can give false positives (see point 2), but the probability of false negatives is low (see point 1).

MATERIALS AND METHODS The correlation analyses have been performed using the EMBL datalibrary (daily updated) resident at the Italian EMBnet node in the ACNUC (14) format. The CORRELA program has been developed on a VAX3900 DEC computer under VMS operative system. The tetramer distribution vocabulary of all sequences contained in the EMBL datalibrary (release 27) has been produced. The CORRELA program is available along with the tetramer vocabulary and the program for its updating. Similarity searches using the FASTA package have been performed through the EMBL fileserver.

RESULTS Internal correlation tests As a first test of the usefulness and applicability of the correlation method, we calculated the correlation coefficient between a sequence of length L2 and a set of its sub-sequences. From the tandem reiteration of the oligomer ACGT, a 1000 bp sequence was generated and compared (choosing n =4) with subsequences of variable length LI (100, 200, 300, ..., 900 bp). This artificial example was chosen so as to start from the simple case in which the oligomer frequency distribution is identical in the compared sequences. The results (plotted in Fig. 1 as a function of LI/L2) give a horizontal line which graphically confirms statement 3, i.e. the fact that r does not depend on the length of the compared sequences, as long as the frequency distribution does not change. However, when the same procedure is used with real sequences (the examples in Fig. 1 refer to two Adenovirus type II sequences) the result is different, and depends on the degree of internal correlation of the longer sequence. In effect the shape of the oligomer frequency distribution varies along the mother sequence, and therefore between this latter and its sub-sequences. The two real examples show that in order to perform a rigorous comparison using the correlation test on two sequences of unequal length, one should proceed with a window analysis, scanning the longer sequence with a window whose width corresponds to the length LI of the shorter sequence. An indicative comparison might be performed directly only if LI and L2 are of the same order of magnitude (e.g. LI/L2 > 0.4 in the case illustrated in Fig. 1, which refers to the choice n=4).

Overall versus microscopic similarity How does the overall similarity, expressed by the correlation coefficient of oligonucleotide distribution, compare with the microscopic similarity, expressed by the percentage of perfect matches in the aligned sequences? To answer this question both indices have been calculated in a number of pairwise comparisons

1.4-

1.2

-

0.8 0.6

0.4

02

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

L 1A2 Fig. 1. Results of correlation analysis between a sequence of length L2 and a set of its subsequences of length L1. The correlation index r, calculated for n=4 (see text), is plotted as a function of L1/L 2 for: (x) an artificial mother sequence (L = 1000 bp), constructed by the tandem reiteration of the oligomer ACGT; ( +) the initial and (@) the terminal 3000 bp of the linear genome of Adenovirus type 2. In each case the subsequences chosen for the comparison corresponded to the first LI nucleotides of the mother sequence.

Nucleic Acids Research, Vol. 20, No. 1 133 of test sequences corresponding to a defined mitochondrial DNA fragment (896 bp in length) from different primate species. The results are given in Fig.2. For each pair of compared sequences we calculated both h (number of matches/total length; best alignment procedures are not necessary in this case) and r (correlation index using n =4). It can be seen that the two parameters follow similar variations and are also numerically similar. Correlation analysis on the sequences of a gene functionally conserved across species As a direct test of the ability of the correlation method to reveal sequence homology, we analysed a set of genes from widely different species, all coding for the same enzyme protein, glutamine synthetase (GS). This choice was based on the observation (15) that this gene is one of the few which behave as a perfect 'molecular clock', the base substitution rate in second codon position being apparently constant, so that the calculated divergences reflect estimated evolutionary time periods. The GS genes we compared are those listed in Fig.3, which presents in the form of a square matrix the results of the comparisons performed on all possible pairs. The upper diagonal half of the matrix contains the values of correlation coefficient r, and the lower half the microscopic similarity index h, estimated according to Wilbur and Lipman (4). Since the lengths of the examined sequences are of the same order of magnitude, a direct calculation of the correlation index was performed (using n=4), without resorting to window analysis. Only values higher or equal to 0.6 are indicated for r while for h only values .0.7 are reported. The clusters formed by the mammal and plant GS genes are thus clearly recognizable in the matrix, along with a pair of highly related sequences, i.e. the GS genes from Escherichia coli and Salmonella typhimurium, while, e.g., the Anabaena GS gene appears to be scarcely related to any of the others.

From this analysis we conclude that the correlation analysis identifies the correct relationships between phylogenetically related sequences. The method is less suited for the detection of local similarities between conserved portions of phylogenetically distant sequences, although window correlation analysis might be utilised (see next section). Location of a subsequence related to a given sequence In order to test whether correlation analysis lends itself to locate in a sequence those regions which are related to a given subsequence, we have chosen to work on a small, complete genome, that of coliphage lambda. We first tested the ability of CORRELA program to find the exact position of a lambda subsequence (LAMBET, 3400 bp in length). The complete lambda genome was scanned using window correlation analysis, by moving a window of width LI =3400 bp in steps of L 1/4 and by calculating for each step the correlation coefficient with three different choices of n (n=3,4,5). The results are reported in Fig.4a. For all three choices of n, a well defined peak, near to unity at its maximum value, is found at the expected position (indicated by a bar in Fig.4a). The signal/noise ratio is progressively filtered, as expected, when increasing the n value. Secondary peaks, corresponding to regions of oligonucleotide frequency distribution similar to that of the tested subsequence, are in effect progressively reduced, while the peak corresponding to exact location is unaffected by the increase in n. Since the oligonucleotide frequency distribution in a given sequence also obviously depends on its bse composition, we plotted on the same graph the G +C content of each window (dotted line in Fig.4a). Comparison of the superimposed plots clearly shows that the G +C content alone is not the main factor determining good correlation. We then searched for the immunity region in the lambda genome, by performing the same kind of analysis against the immunity region LAIMM434 (not recognised by the lambda repressor) of the related phage lambda-imm434. This sequence (1287 bp) is correctly located (Fig.4b) by a correlation peak which, as expected, does not reach unity. bp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Fig. 2. Comparison between the correlation index, r (calculated for n=4) and the similarity index h (percentage of perfect matches) for all possible pairs formed by the following sequences (accession number from the EMBL release 23 are indicated first, followed by entry name, indication of the species and symbol used in the figure) V00658, MIGG45, Gorilla, GG; V00659, MIHL45, Gibbon, HL; V00672, MIPP45, Chimpanzee, PP; V00675, MIPY45, Orang-utan, PY. These sequences, 896 bp in length, correspond to the Hind III restriction fragment of the mitochondrial DNA which contains the genes for transfer RNAs specific for histidine, serine and leucine, and part of the reading frames for subunits 4 and 5 of NADH dehydrogenase

Rattus norvegicus

Chinese hamster Homo sapiens

Drosophila metanogaster Medicago sativa Pisun sativum (nd) Phaseolus vulgaris (nd)

(1119) (1119) (1119) (1195) (1068)

(77)

(1068) (1290) (1068) (1071) (1287) (1422) (1410) (1404) Salmonella typhimurium Thiobacitlus ferrooxidans (1404) (987) Bradyrhizobium japonicum

Pisum sativuL (rt) Phaseolus vutgaris (rt) Pisul sativum (cp) Phaseotus vutgaris (cp) Anabaena 7120 Escherichia coli

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 .8 .7 .9 1 .8 .9 .9 1

1 .9 .8 .7 .9 .8 .7

.8 .8 1 .7 .8 1 .7 .7 .8 .8 .8 .9 .7 .7

.7 .8 .7 .7 .6 .6 .6 .6 .6 .8 .6 .6 1 .7 .7 .8 .7 1 .8 .7 .7 .8 1 .7 .9 .7 .7 1 1 .8 .9 1 .6 1 .6

Fig. 3. The nuclear glutamine synthetase (GS) genes from various species and, in the case of plants GS genes, differentially expressed in nodules (nd), roots (rt) and leaves (cp), were paired in all possible ways. For each pair, the correlation index r (n=4) and the similarity index h (calculated according to the Wilbur and Lipman algorithm with the following parameters: k-tuple size=3; gap penalty =3; largest gap=20) were reported respectively in the upper right and lower left halves of the matrix. Only r values higher than 0.6 and h values higher than 0.7 are reported. The length (in bp) of each sequence is indicated in brackets. Accession numbers, from the EMBL release 23, for deposited sequences are: 1-M29579,

2-X03495, 3-Y00387, 5-X03931, 6-X05515, 7-X04001, 8-X04763, 9-X04002, 10-X05514, 11-X12738, 12-X00147, 13-M13746, 14-M14536, 15-M16626, 16-X04187. Sequence 4 was taken from ref.22.

134 Nucleic Acids Research, Vol. 20, No. 1

a

MD0(1

I87TA&(

2 TPX1I 3 B7P45C21 4 BTF1& 5 B1TPG 6 8TPLC3A 7 87T8.

0.9

n=3 0.8

0.7-n4 % GC

0.8

58

0.4

BTU7(I 20 BTATP8 21 IB7TASD 46 22 8TTDT7 45 45 22 BTPH92 19

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2 29 30 31 32 33 34 35 36 37

1

60 59 56 57 57 56 55 55 55 55 54 51 51 49

14 BTSAR 15 8T639 16 6TPPA 17 BTP8A19 16 7TP9D0

0.5

64

8TD0Y91 8TC0XPS 11 8TW1k4.16 12 BTTI4 I3 0TPDI

8

5.6.7 .7 .7 .6 .6 .5 .5 .6 .5 .5 .5 .5 .6 .6 .5 .6 .7 .7 .6 .6 5.55.6.6 .5 1 .6.7 .6.6.5.6.5.7.6.5 .5 9 .7.5.7.5.5.6.5.6.6.5 I .6.7.5.7.6.6.7.6.5 .6.5

63 61

86BTAS 9 10

2 3 4 5 6 7

65

.5

.6.6.6.5.6.6.5 .5 .5.5 .5

I

.5

.9

.9

9

1

o1

20

30

40

6

Kb

b

.9..9.9..9

1.9

.9.9

.9.9.9.9.9

.9.9

9.9

.9.9.9

34

.9 .9

.9

36

41bT8X.10 39 4I1T1I.6 39 37

T37 1IBTX1.2 35

0.7

.5

.6.5

.5

.5

.5.5

.9 .9 .9 .9

.5 .9 .9

.5

.6

.5 .5. .6 .5

.9

.5

.9 .9

9

0(8424 43

418TI1.12

.5.6.5

.6.5

.9

35 4I8TX0.4 39

n=4

.6.5

.5

9

9

.6

.5

1.9.9

1.9

27 8TP91IC 41 26 M18BTI.7 U 29 418TI1.8 43 ',4 187IT.3 42 31 4I1T70.1 41 32 4I1T71.11 40 33

.5

.5

.6

9

24 BTLS2DA 45 25 ISTAY;i 44 267

.5 .5

.6.6.7

49

0

.6.5 .5 .6 .6

.9 .9 1.9

.9 1.9

.9 .9 1 .9

.5 .5

.6 .S 6 .5

.7 .5 .7

.6

.9

.9.9

.5.5

.9 .9

.9 .6 .9 19 .9 .9.9

.9

.5.5

1 .9 .9.9.9

..5.5

.5

.6.5.6.5

.5

.6.7.7.5.6 .9 1.9 7.7.7.5 1.9.9.9 1.9.9.9 1 .8.7.6 .9.9.9.9.9 1 1 .7.5 1 .5 .9.9.9.9 .9 1.9 1 1 .9 .9 .9 .9 .9 .9 .9 .9.9 .9 .9 1.9 .9

7

6

.86 7 6

6

0.9.

r

Fig. 5. Correlation analysis

0.5

on a set of codogenic sequences belonging to the of Bos taurus. The compared sequences, indicated by their name in EMBL release 23, are numbered in order of decreasing G+C content (indicated after the sequence name). The upper diagonal half of the matrix gives the r values calculated with n=4 for each pair of genes. In the lower half of the matrix, the values of the ratio, t, between average G+C content in third codon position in the two compared sequences are reported. r values lower than 0.5 and t values deviating from unity by more than 10% have been suppressed, as has the principal diagonal axis. For simplicity in presentation, t values larger than unity have been inverted.

genome

0

10

30

20

40

6C 0

Kb

Fig. 4. Search of regions related to a given subsequence along the lambda genome. The 48502 bp of the complete lambda genome were scanned by window correlation analysis a) with the subsequence, LAMBET (EMBL release 23, accession number V00638) using various oligomer lengths n (n=3,4,5); b)with the immunity region LAIMM434 (EMBL release 23, accession number J02460) of phage lambdaimm434, using n=4. In both cases a window whose width corresponded to the length, LI, of the tested subsequence (L1 =3400 bp and L1 = 1287 bp respectively) was shifted in steps of L1/4. The expected position along the genome is indicated by an horizontal bar. The percent G + C content of each window along the lambda genome is also reported in fig.4a (dotted line).

Correlation index and base composition A deeper investigation of the relationship between correlation index and base composition was performed on a set of coding sequences belonging to the same genome, that of Bos taurus. For each pair of sequences, two indices were calculated and are reported as the two diagonal halves of the same matrix in Fig.5. The first index, reported in the upper, right-hand part of the matrix is the correlation coefficient r (calculated for n =4). Only values higher or equal to 0.5 are marked. The second index, t, is defined as the ratio between the G +C content in the third codon positions of the two compared genes. Only values deviating from unity by 10% or less are reported. Genes are ordered in the matrix according to their overall G+C content. Several observations stem from the examination of this matrix. Firstly, some clusterization appears. This is particularly evident in the mitochondrial genes and, less clearly, in the G+C rich genes (% G+C in the range 55-65%). Secondly, the fact that the two halves of the matrix are grossly symmetrical with respect to the diagonal axis (left blank in the figure for the sake of clarity) suggests that in the case of coding sequences the correlation coefficient reflects, to a certain extent, the similarity in third position choice. However, a close inspection reveals that there are several cases in which t values are high

and yet no significant correlation is found. Conversely, there are cases in which a significant correlation coefficient is found between pairs of genes which do not follow the rule that 3rd position choices reflect the overall G+C content (examples of both cases can be observed by comparing column 1 and row 1). Therefore the correlation analysis yields additional information. A particularly interesting case is that of the mitochondrial gene ND6 (n.36 in the matrix) which is the only gene located on the heavy strand of the mitochondrial genome. The asymmetry in G/C distribution between the two strands obviously does not affect the t value, but does strongly reduce the overall similarity with other mitochondrial sequences.

Comparison of CORRELA to FASTA in database screening In order to test the efficacy of the CORRELA program in screening a large database, a sample of recently sequenced genes (not yet present in the EMBL release 27 of 31.03.1991) were chosen as test sequences and compared against the whole EMBL release 27, using both CORRELA and FASTA programs for similarity searches. The three chosen sequences are: the protein kinase regulator from Homo sapiens (entry name: HS3B5H5E, accession number: X55997, length: 1528 bp); the ribosomal RNA small subunit from Aspergillus fumagatus (AFDA, M55626, 1798 bp); the hydrogenase transcriptional activator HOXA gene from Alcaligenes eutrophus (AEHOXA, M64593, 1811 bp). The results are given in Table I. The 90 sequences extracted by FASTA as those giving the best similarity score to each of the test sequences were listed in order of decreasing 'opt' score value.

Nucleic Acids Research, Vol. 20, No. 1 135 Table I. Results of similarity searches on the EMBL database (release 27) performed by FASTA and CORRELA for three test sequences taken from the daily EMBL updating, but not present in release 27: HS3B5H5E (Accession number X55997), AFDA (Accession number M55626) and AEHOXA (Accession number M64593). For each test sequence the first section of the table reports the names (preceded by the taxonomic class code) of the sequences extracted by FASTA, ranked in the order of decreasing 'opt' score value, then the 'opt' score value itself and finally the correlation index r. An asterisk highlights r values higher than 0.5. The next sections of the table report the numbers of sequences not present in the FASTA list and having an r value higher than 0.7, or comprised between 0.5 and 0.7, respectively. The last section reports, for each test sequence, the CPU times required for FASTA and CORRELA searches. The FASTA package used is the one installed on the EMBL VAXCluster which includes a set of VAX6000 machines. CORRELA analysis was performed on a VAX3900 machine. R"uts of si-itarity IS3ltSE I=FAI NK e*elrh .pinst ENKcr1leo.

27

3.q

apt

P619:sb66

6105 6105

apt

r

7192

1.06' 0.67 0.94' 0.77

0.76'

96:AOPr*

"78

4

0.64

F: CgrrI6

6060

0.97'

656

4

0.6

R3:Pirp6s

6665

I:6t36.d IK:D:3bd.6

3614

0.74*

2605

36:6aDbb36 M3:lthO6b

205

0.73' 0.76'

AW:6o.rr _ FUN:Cmr6S RP:Tdsrsr

S13 S60

2762

0.66

no:Cg0rrnm

P61:66100*2

67

P61:1&36661

672

0.62

P61I66h6b

566

0.66'

Ih:6366o60 tlV:D_h92cd

Il:D_rb3o 169:D5

0.64'

566

125

125 123 117

42

Pt6:t66163 m0:45p2. 366:_ P6I:Hsraaf&2 06:6A25923 ItO O:mcntro P1l:6tifi56r P61:f0tf93

Plt6l7arrra 5724 S701 PIU:4*rr16s

0.71 0.71'

5637

S577

0.66' 0. 73

0.33

IW:Pol6*rrl RUM:Cal6s R3:6nd.

54

076

0.56

Re:Ceso

523

0.61'

RM:Ckoob

5223

0.62' 0.66'

5177

103

0.48

R6:Cparm. FU: "ran PlM:Ckrr6n6

5136

I6W:Aoui&

5110 5109

P66:Cvsrm

S106

0.41 0.57'

0.51'

P.n:cwmow

*

PL.:S

FUW:Erpot

95

10V:Pcstl

95

< 0.22 0.21 0.42

t:Nthbb

P61

366:NwWyn

PItl:Nsbri2c iN6:_s43 I69:D"ooo

0 0

0.610.30

P4U:CfOrPn6t PLU Oarrl7s

PLN:Noerrnil

4971 495

P

0.62' 0.70' 0.69

60:

0.62'

0.74' 6.714'

61

0.62'

608

0.54'

607

P3:Rtnif.

S45

P0:Rtnifpre

537

Pm0:66tff.

0.55' 0.49

517

0.51'

0:6

51

0.75'

U6A:6524

377 314

P60:Ecytr

0.53'

PM0:8a19frl2 P90:6twrb

0.46

PMO:fcrpicpu

0.61' 0.57'

P 0

115

s

6.597

P6O:Wi16

437 4900 4783 466

0.5

o.n-

517

0.6

0.797

0.28

0.76

660 660

569

0.7

O0.6

67

707

0.66' 0.66'

P:0r10

0."4 0.71-

4662 4559

87

719

669

0.70'

0.42

I:Ourml16

67

P0:6ntr PN0:Awtif&o o:_ct PNO: ttd

0.73'

0.74'

PLN:Odb

0.23

PNO:Avnifnf

0.72'

570

O.50

0.41 0.45 0.26

749 741 741

PIO:Ec0ttat

0.45 0.21

69

P0O:Acnif PRO:CcftId PIIO:Cctrpx

0.72'

0.4

60

0.61'

591

91 91 91

0C3:Cbosx-2

0.76'

m

P3atInif7bm 765 PRO:oni,ftl 749 P3O:Acn6 f2

6*2

PI.:Pmro17s PIN:Nprr16s PW:Abac

760

PMD:Atntrc

0.76'

760

P3O:Eegtn P36:td1t* d

0.67'

4962

26g3

PNr:2gr17s

r

959

0.MI 0.71' 0.67'

0.660 4977

apt

PRD:A0nifa PN lpnif P6:0nif

0.71'

o.s5

89

1p4

S61

mI6t PV:AOtrml

-r

PO:K5arbe P6:1nif9

91

W3L:Orf0.6lO 300:4

0.67' 0.77

5693

1t 0.25 0.23

110 109

94 94 95 92

0.71'

0.6

97

:cinvot

0.73

5700

95

36:CIdS6o

0.74'

96:S2cr,. PU:ScrOs

95

l0m:Tbgre*43 VRT:X3tk0

579

114

91

0.81' 0.7

FUP:Rrpstr

P61

0.4s 0.36

247

0.74'

137 133

0.61'

8: 1

123

0.16

169::Ncdsf

122

0.26

P6O:Ujfixoj M:Ijrppn2

119

0.697

117

0.73'

R:Opoprms

4506

PLN:Skol6

4502

0.5s

M D:A69c PN0:P_

PILN:Crgl6ot

4502

0.5O9

P6O:A6ni9bq

112 112

PLN:PslSrm PIN:Ddl0rrm P0.:6sl056 ,m

4501 4460

0.63'

110 110

0.60'

0.27 0.69 O.", 0.71' 0.11 0.3s

O.5

U6A82626 PR1:Nsir3ol

30:45lpl FUl:Sexrnl FUl:Sck_ig

87 87

0.45

4474

0.57'

W1T:X0Lmc

0.29

P.l:6l1Porm

"41

o.5r

07

0.23

4395

0.07

RCO:Nmtp3

87 85 as 8S

0.49

P0N:3sl1rr IW9:A.srol6

4346

O0.5

0.32

P0.:Cphirrne

0.67'

109 109 107 107 105

0.46 0.47

6W:6Orrrn IW:Oqrm16t

4269 416

V3T:Xtw1.f PDO:D09o8 PMO:A.ofxpl VtT :Xt1.i i

0.56'

P61:69.poe

104

.0 0. 55

0.54'

P61:Phpoo

64

4155 4149

104 101 100

0.17

W9

0.29

96 96

0.17

RP:Scstpb FUN: Sc.pl

FUN:Scpp$o 60:6 6nrb R3O:Nintbr IlV:Tg217? IlV:D6tropil V9T:Cgvi.l FPU:Sectr VL:Cooodr VRT:6Gt02

P6I:Nsdxwihc FtX:Scpkelc 66:orof i 00:Ts3sblu 300:4tf9c Il0:Ddcam V03T:66014 :i0:D03

0.26

I6:OpornSt

83 83 82 82

0.11 0.16 0.21

V3T:XI rgwi4

01

0.47 0.66'

0.34

W1T:Xtrmoil R3::Ctsr" o:Iorns NW:ARowme

3920 3926

0.56' 0.10 0.41

3909

0.40 0.69 0.11

P31:UG

3777

0.22

0.49

37

0.36

78

0.19

376S

is

0.39 0.17 0.41 0.10

me:&ro RO:_n4rw18 6W:3OromlS PI3:ftiUpnr

3722 3676 3670

0.38 0.40 0.597 0.62'

3628 3600

1.66'

P6:Phgcrl Ill:Ec_um 6:lt sl PO:EopPo

3600

0.60'

*3:Rnfor6pr

PO:Acvnfdgk P3:Acvnf

69 69

NM:Utodro

U

PLIN:Asr4nl6t P0.:Gtowa

F16:Lgli6s PUN:grgl6sL

0.35

0.44

78

31596

0.43

77

0.20 0.25

PLN Gsal"

0RG:6131 P6I:Nsdsg

PLN:0Cvodd

0.45

0.586

V5L:VN

77

A:Ocerrn,01 PlN:oel6s

356 3490 343 3139

139V:0OVAdr 76

ltlV:06t

76

RCD:Narpb

76 76 76

0.30 0.65' 0.32 0.37

0.650 0.506 0.52'

INViPtasrm 169:pc03

PLN:Tarrl6a

2673

0.52'

IllV:Dd4

269 2364 2229 2229 2192

0.52' 0.54'

7n

0.34

11VTpu.1286

0.56

fUl:Pw16t

ROD:6rwm1

0.29

Fux:Po16o

73

0.60'

P00:6660

75

0.26

PUN:3916sl PU06:o1681

7n

0.51'

71

O."

gm

66:Nmas

36:Ne1tr

71

0.52'

INV:Pbrsss

0.56' 0.03

FLPU:Yisrs

Wll:Cktrus

1245

PLM:l.sy

71

pIt:NIsomm

74

0.41

I:Acrges

PLN:0hsph

74

0.15

IlV:Pbrdc

Further sequences fro CORRELA search with r > 0.7

2192 2153 1653 1613 1322

0.53'

s

36_Ne:t0 ce

FUN:Cltsrsu FlP:Nrrrn"s llV:TwPi6o

0.29

92 92

0.07

90

90 89

0 0.6'

0.32

:e 0.12

m:etetc mm:0t*l.

66

0.06

86

0.09

66

0.15

66

0.33

06

0.33

X6 X6 U2

0.16

0.71'

66:Itpkic

61

0.37

0.7

0.650 0.55' 0.53*

0.46

0.35

0.6r 0.55'

:

266:622024

so

I:N"btpsyb Plt0:6stpsyl P31":Nsr132 169:Cs.pol

79 79

P

77 77

*0

0.09 0 .0

0.21 0.04

0.10

112

0

15

00:42:34 00:01:59

0.53' 0.46

P1oI:0coi211 19T1:Ogrkin

916 824

4203

00:16:16 00:01:55

93

0.63'

0.63'

1210

6

Further sequenes from COOtRELA s*arch with 0.5 r

A simple method for global sequence comparison.

A simple method of sequence comparison, based on a correlation analysis of oligonucleotide frequency distributions, is here shown to be a reliable tes...
1MB Sizes 0 Downloads 0 Views