Nucleic Acids Research, Vol. 20, No. 1 131-136
A simple method for global
sequence
comparison
Elisabetta Pizzi*, Marcella Attimonelli, Sabino Liuni, Clara Frontali1 and Cecilia Saccone Centro Studi Mitocondri e Metabolismo Energetico CNR, Dipartimento di Biochimica e Biologia Molecolare, University of Bari, 70126 Bari and 1Laboratorio di Biologia Cellulare, Istituto Superiore di Sanita, 00161 Rome, Italy Received April 29, 1991; Revised and Accepted December 4, 1991
ABSTRACT A simple method of sequence comparison, based on a correlation analysis of oligonucleotide frequency distributions, is here shown to be a reliable test of overall sequence similarity. The method does not involve sequence alignment procedures and permits the rapid screening of large amounts of sequence data. It identifies those sequences which deserve more careful analysis of sequence similarity at the level of resolution of the single nucleotide. It uses observed quantities only and does not involve the adoption of any theoretical model.
INTRODUCTION The continually increasing amount of sequence data generated by the recent projects on genome sequencing necessitate high speed methods for the characterization and analysis of new sequences. When a new sequence is generated, a comparative analysis with a nucleic acid (or protein) sequence database is generally performed as a first step in the attempt to characterize its function. Several methods are available which perform this analysis by
detecting similarity through best-alignment procedures. As is now generally accepted (1 -3), when dealing with sequence comparisons, the term similarity is not equivalent to homology, a qualitative concept which includes a common evolutionary origin.
Several, widely used computer programs for similarity search (4-8), are based on variations of the scoring system introduced by Smith and Waterman (9). In order to detect biologically important similarities, these methods emphasize the relevance of those segments of the compared sequences which maximize the similarity score. They usually require the use of large computers in order to perform in a reasonable time alignmentbased screening of large databases. We investigated the possibility of using a global approach to efficiently pre-screen a large database, in order to rapidly identify those sequences which are related to a given one. We aimed at a user-friendly program, applicable to any type of sequence, that would extract from databases subsets of sequences deserving detailed analysis by more precise and meaningful alignment procedures. *
Global approaches which adopt 'macroscopic' parameters and reflect average sequence properties have been proposed by Blaisdell (10) and by Pietrokovsky, Hirshon and Trifonov (11). Both methods analyse the differences between observed and predicted global sequence properties, the predictions being based on a Markov chain model. These methods have the advantage of reducing the relevance of effects related to differential composition in mono-, di-, or others short oligonucleotides. It is, however, rather doubtful whether a Markovian model of defined order can represent any type of sequence. In order to eliminate the dependence on an arbitrary model, we exploited a suggestion present but not developed in the work of Brendel, Beckmann and Trifonov (12). The idea is to perform a comparative analysis using a test of overall similarity in oligonucleotide composition. This is most quickly done by calculating a correlation coefficient between frequency distributions of nucleotide strings of a given length (oligomers). Through a series of suitable tests it is shown here that this quick and simple macroscopic approach closely reflects the results of an alignment-based similarity search, and that the signal/noise ratio can be efficiently improved by an appropriate choice of oligomer length. The correlation method (CORRELA) yields reliable results in the rapid location of a subsequence related to a given fragment, and in the identification of clusters of phylogenetically related genes. The relation of the global parameter to overall base composition and 3rd codon position choices is discussed. A comparison with current methods (FASTA, 13) appears to confirm the efficacy of the CORRELA program.
Description of the algorithm Let X and Y be the two variables whose correlation has to be tested over a range of values [X] = XI, X2,..., Xk,..., XN and [Y]
=
Yl, Y2
.....,
"Yk
YN. The correlation index is defined
as
Lk
(Xk
Xm) (Yk Ym)
(1)
r=
[Ek
(Xk-Xm)2 Sk
(Yk-Ym)2]1/2
Xm and Ym being the average values of the two variables the given range of values.
To whom correspondence should be addressed at Laboratory of Cell Biology, Istituto
Superiore di Sanitt Rome, Italy
over
132 Nucleic Acids Research, Vol. 20, No. I In our case, Xk and yk are the occurrency numbers for the kth oligomer in sequences x and y respectively. If n is the length of the oligomer (expressed in bp), the number of different oligomers is N=4n and E kXk
Xm
L -n + 1
NkYk
YHll
=
N
4"
N
L,-n + I
4n
where Lx and Ly are the lengths of sequences x and y (expressed in bp) and L-n + 1 is the total number of oligomers of length n which can be accommodated in a sequence of length L, and is equal to the sum of the observed occurrencies. r is an adimensional parameter, ranging from -1 to + 1. However we are interested essentially in the positive range (0 < r < 1), a negative correlation being an unrealistic case. Note that the definition of r given by relation (1) would not change if frequencies were used in the place of occurrencies. At this point the following observations can be made: 1) If two sequences have a high similarity, their correlation coefficient is necessarily near to 1, except in the very unlikely case in which all the Xk are near to Xm, and all the yk are near to Ym' i.e. both sequences have a flat oligonucleotide distribution. 2) A high correlation coefficient can also be found in the case in which the two sequences share little sequence similarity but have similar frequency distributions of oligomers. The probability of such an event obviously decreases when the n value is increased. However, the higher n is, the longer the computational time. In practice, it is convenient to use a value of n ranging from 3 to 5. 3) By definition, r does not depend on the length of the compared sequences. Nevertheless, care must be exercised when comparing sequences of different length, as detailed later. We shall assume in the course of this work that a significant correlation corresponds to r > 0.5. The validity of this assumption is substantiated by the tests described in the following sections. In conclusion, r([X],[Y]) is a global index which expresses a correlation in the deviations from the average oligomer frequencies in two sequences. It obviously does not coincide with similarity indices based on the local comparison of aligned sequences, but rather reflects a similarity in the oligomer frequency distribution. We will therefore use the terms 'overall similarity' and 'microscopic similarity' to distinguish between the two concepts. The correlation coefficient defined in relation (1) can be used to give a numerical indication of overall similarity, approaching the unity value when the two sequences have a high degree of microscopic similarity. Compared with an index expressing microscopic sequence similarity, it can give false positives (see point 2), but the probability of false negatives is low (see point 1).
MATERIALS AND METHODS The correlation analyses have been performed using the EMBL datalibrary (daily updated) resident at the Italian EMBnet node in the ACNUC (14) format. The CORRELA program has been developed on a VAX3900 DEC computer under VMS operative system. The tetramer distribution vocabulary of all sequences contained in the EMBL datalibrary (release 27) has been produced. The CORRELA program is available along with the tetramer vocabulary and the program for its updating. Similarity searches using the FASTA package have been performed through the EMBL fileserver.
RESULTS Internal correlation tests As a first test of the usefulness and applicability of the correlation method, we calculated the correlation coefficient between a sequence of length L2 and a set of its sub-sequences. From the tandem reiteration of the oligomer ACGT, a 1000 bp sequence was generated and compared (choosing n =4) with subsequences of variable length LI (100, 200, 300, ..., 900 bp). This artificial example was chosen so as to start from the simple case in which the oligomer frequency distribution is identical in the compared sequences. The results (plotted in Fig. 1 as a function of LI/L2) give a horizontal line which graphically confirms statement 3, i.e. the fact that r does not depend on the length of the compared sequences, as long as the frequency distribution does not change. However, when the same procedure is used with real sequences (the examples in Fig. 1 refer to two Adenovirus type II sequences) the result is different, and depends on the degree of internal correlation of the longer sequence. In effect the shape of the oligomer frequency distribution varies along the mother sequence, and therefore between this latter and its sub-sequences. The two real examples show that in order to perform a rigorous comparison using the correlation test on two sequences of unequal length, one should proceed with a window analysis, scanning the longer sequence with a window whose width corresponds to the length LI of the shorter sequence. An indicative comparison might be performed directly only if LI and L2 are of the same order of magnitude (e.g. LI/L2 > 0.4 in the case illustrated in Fig. 1, which refers to the choice n=4).
Overall versus microscopic similarity How does the overall similarity, expressed by the correlation coefficient of oligonucleotide distribution, compare with the microscopic similarity, expressed by the percentage of perfect matches in the aligned sequences? To answer this question both indices have been calculated in a number of pairwise comparisons
1.4-
1.2
-
0.8 0.6
0.4
02
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
L 1A2 Fig. 1. Results of correlation analysis between a sequence of length L2 and a set of its subsequences of length L1. The correlation index r, calculated for n=4 (see text), is plotted as a function of L1/L 2 for: (x) an artificial mother sequence (L = 1000 bp), constructed by the tandem reiteration of the oligomer ACGT; ( +) the initial and (@) the terminal 3000 bp of the linear genome of Adenovirus type 2. In each case the subsequences chosen for the comparison corresponded to the first LI nucleotides of the mother sequence.
Nucleic Acids Research, Vol. 20, No. 1 133 of test sequences corresponding to a defined mitochondrial DNA fragment (896 bp in length) from different primate species. The results are given in Fig.2. For each pair of compared sequences we calculated both h (number of matches/total length; best alignment procedures are not necessary in this case) and r (correlation index using n =4). It can be seen that the two parameters follow similar variations and are also numerically similar. Correlation analysis on the sequences of a gene functionally conserved across species As a direct test of the ability of the correlation method to reveal sequence homology, we analysed a set of genes from widely different species, all coding for the same enzyme protein, glutamine synthetase (GS). This choice was based on the observation (15) that this gene is one of the few which behave as a perfect 'molecular clock', the base substitution rate in second codon position being apparently constant, so that the calculated divergences reflect estimated evolutionary time periods. The GS genes we compared are those listed in Fig.3, which presents in the form of a square matrix the results of the comparisons performed on all possible pairs. The upper diagonal half of the matrix contains the values of correlation coefficient r, and the lower half the microscopic similarity index h, estimated according to Wilbur and Lipman (4). Since the lengths of the examined sequences are of the same order of magnitude, a direct calculation of the correlation index was performed (using n=4), without resorting to window analysis. Only values higher or equal to 0.6 are indicated for r while for h only values .0.7 are reported. The clusters formed by the mammal and plant GS genes are thus clearly recognizable in the matrix, along with a pair of highly related sequences, i.e. the GS genes from Escherichia coli and Salmonella typhimurium, while, e.g., the Anabaena GS gene appears to be scarcely related to any of the others.
From this analysis we conclude that the correlation analysis identifies the correct relationships between phylogenetically related sequences. The method is less suited for the detection of local similarities between conserved portions of phylogenetically distant sequences, although window correlation analysis might be utilised (see next section). Location of a subsequence related to a given sequence In order to test whether correlation analysis lends itself to locate in a sequence those regions which are related to a given subsequence, we have chosen to work on a small, complete genome, that of coliphage lambda. We first tested the ability of CORRELA program to find the exact position of a lambda subsequence (LAMBET, 3400 bp in length). The complete lambda genome was scanned using window correlation analysis, by moving a window of width LI =3400 bp in steps of L 1/4 and by calculating for each step the correlation coefficient with three different choices of n (n=3,4,5). The results are reported in Fig.4a. For all three choices of n, a well defined peak, near to unity at its maximum value, is found at the expected position (indicated by a bar in Fig.4a). The signal/noise ratio is progressively filtered, as expected, when increasing the n value. Secondary peaks, corresponding to regions of oligonucleotide frequency distribution similar to that of the tested subsequence, are in effect progressively reduced, while the peak corresponding to exact location is unaffected by the increase in n. Since the oligonucleotide frequency distribution in a given sequence also obviously depends on its bse composition, we plotted on the same graph the G +C content of each window (dotted line in Fig.4a). Comparison of the superimposed plots clearly shows that the G +C content alone is not the main factor determining good correlation. We then searched for the immunity region in the lambda genome, by performing the same kind of analysis against the immunity region LAIMM434 (not recognised by the lambda repressor) of the related phage lambda-imm434. This sequence (1287 bp) is correctly located (Fig.4b) by a correlation peak which, as expected, does not reach unity. bp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fig. 2. Comparison between the correlation index, r (calculated for n=4) and the similarity index h (percentage of perfect matches) for all possible pairs formed by the following sequences (accession number from the EMBL release 23 are indicated first, followed by entry name, indication of the species and symbol used in the figure) V00658, MIGG45, Gorilla, GG; V00659, MIHL45, Gibbon, HL; V00672, MIPP45, Chimpanzee, PP; V00675, MIPY45, Orang-utan, PY. These sequences, 896 bp in length, correspond to the Hind III restriction fragment of the mitochondrial DNA which contains the genes for transfer RNAs specific for histidine, serine and leucine, and part of the reading frames for subunits 4 and 5 of NADH dehydrogenase
Rattus norvegicus
Chinese hamster Homo sapiens
Drosophila metanogaster Medicago sativa Pisun sativum (nd) Phaseolus vulgaris (nd)
(1119) (1119) (1119) (1195) (1068)
(77)
(1068) (1290) (1068) (1071) (1287) (1422) (1410) (1404) Salmonella typhimurium Thiobacitlus ferrooxidans (1404) (987) Bradyrhizobium japonicum
Pisum sativuL (rt) Phaseolus vutgaris (rt) Pisul sativum (cp) Phaseotus vutgaris (cp) Anabaena 7120 Escherichia coli
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 .8 .7 .9 1 .8 .9 .9 1
1 .9 .8 .7 .9 .8 .7
.8 .8 1 .7 .8 1 .7 .7 .8 .8 .8 .9 .7 .7
.7 .8 .7 .7 .6 .6 .6 .6 .6 .8 .6 .6 1 .7 .7 .8 .7 1 .8 .7 .7 .8 1 .7 .9 .7 .7 1 1 .8 .9 1 .6 1 .6
Fig. 3. The nuclear glutamine synthetase (GS) genes from various species and, in the case of plants GS genes, differentially expressed in nodules (nd), roots (rt) and leaves (cp), were paired in all possible ways. For each pair, the correlation index r (n=4) and the similarity index h (calculated according to the Wilbur and Lipman algorithm with the following parameters: k-tuple size=3; gap penalty =3; largest gap=20) were reported respectively in the upper right and lower left halves of the matrix. Only r values higher than 0.6 and h values higher than 0.7 are reported. The length (in bp) of each sequence is indicated in brackets. Accession numbers, from the EMBL release 23, for deposited sequences are: 1-M29579,
2-X03495, 3-Y00387, 5-X03931, 6-X05515, 7-X04001, 8-X04763, 9-X04002, 10-X05514, 11-X12738, 12-X00147, 13-M13746, 14-M14536, 15-M16626, 16-X04187. Sequence 4 was taken from ref.22.
134 Nucleic Acids Research, Vol. 20, No. 1
a
MD0(1
I87TA&(
2 TPX1I 3 B7P45C21 4 BTF1& 5 B1TPG 6 8TPLC3A 7 87T8.
0.9
n=3 0.8
0.7-n4 % GC
0.8
58
0.4
BTU7(I 20 BTATP8 21 IB7TASD 46 22 8TTDT7 45 45 22 BTPH92 19
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2 29 30 31 32 33 34 35 36 37
1
60 59 56 57 57 56 55 55 55 55 54 51 51 49
14 BTSAR 15 8T639 16 6TPPA 17 BTP8A19 16 7TP9D0
0.5
64
8TD0Y91 8TC0XPS 11 8TW1k4.16 12 BTTI4 I3 0TPDI
8
5.6.7 .7 .7 .6 .6 .5 .5 .6 .5 .5 .5 .5 .6 .6 .5 .6 .7 .7 .6 .6 5.55.6.6 .5 1 .6.7 .6.6.5.6.5.7.6.5 .5 9 .7.5.7.5.5.6.5.6.6.5 I .6.7.5.7.6.6.7.6.5 .6.5
63 61
86BTAS 9 10
2 3 4 5 6 7
65
.5
.6.6.6.5.6.6.5 .5 .5.5 .5
I
.5
.9
.9
9
1
o1
20
30
40
6
Kb
b
.9..9.9..9
1.9
.9.9
.9.9.9.9.9
.9.9
9.9
.9.9.9
34
.9 .9
.9
36
41bT8X.10 39 4I1T1I.6 39 37
T37 1IBTX1.2 35
0.7
.5
.6.5
.5
.5
.5.5
.9 .9 .9 .9
.5 .9 .9
.5
.6
.5 .5. .6 .5
.9
.5
.9 .9
9
0(8424 43
418TI1.12
.5.6.5
.6.5
.9
35 4I8TX0.4 39
n=4
.6.5
.5
9
9
.6
.5
1.9.9
1.9
27 8TP91IC 41 26 M18BTI.7 U 29 418TI1.8 43 ',4 187IT.3 42 31 4I1T70.1 41 32 4I1T71.11 40 33
.5
.5
.6
9
24 BTLS2DA 45 25 ISTAY;i 44 267
.5 .5
.6.6.7
49
0
.6.5 .5 .6 .6
.9 .9 1.9
.9 1.9
.9 .9 1 .9
.5 .5
.6 .S 6 .5
.7 .5 .7
.6
.9
.9.9
.5.5
.9 .9
.9 .6 .9 19 .9 .9.9
.9
.5.5
1 .9 .9.9.9
..5.5
.5
.6.5.6.5
.5
.6.7.7.5.6 .9 1.9 7.7.7.5 1.9.9.9 1.9.9.9 1 .8.7.6 .9.9.9.9.9 1 1 .7.5 1 .5 .9.9.9.9 .9 1.9 1 1 .9 .9 .9 .9 .9 .9 .9 .9.9 .9 .9 1.9 .9
7
6
.86 7 6
6
0.9.
r
Fig. 5. Correlation analysis
0.5
on a set of codogenic sequences belonging to the of Bos taurus. The compared sequences, indicated by their name in EMBL release 23, are numbered in order of decreasing G+C content (indicated after the sequence name). The upper diagonal half of the matrix gives the r values calculated with n=4 for each pair of genes. In the lower half of the matrix, the values of the ratio, t, between average G+C content in third codon position in the two compared sequences are reported. r values lower than 0.5 and t values deviating from unity by more than 10% have been suppressed, as has the principal diagonal axis. For simplicity in presentation, t values larger than unity have been inverted.
genome
0
10
30
20
40
6C 0
Kb
Fig. 4. Search of regions related to a given subsequence along the lambda genome. The 48502 bp of the complete lambda genome were scanned by window correlation analysis a) with the subsequence, LAMBET (EMBL release 23, accession number V00638) using various oligomer lengths n (n=3,4,5); b)with the immunity region LAIMM434 (EMBL release 23, accession number J02460) of phage lambdaimm434, using n=4. In both cases a window whose width corresponded to the length, LI, of the tested subsequence (L1 =3400 bp and L1 = 1287 bp respectively) was shifted in steps of L1/4. The expected position along the genome is indicated by an horizontal bar. The percent G + C content of each window along the lambda genome is also reported in fig.4a (dotted line).
Correlation index and base composition A deeper investigation of the relationship between correlation index and base composition was performed on a set of coding sequences belonging to the same genome, that of Bos taurus. For each pair of sequences, two indices were calculated and are reported as the two diagonal halves of the same matrix in Fig.5. The first index, reported in the upper, right-hand part of the matrix is the correlation coefficient r (calculated for n =4). Only values higher or equal to 0.5 are marked. The second index, t, is defined as the ratio between the G +C content in the third codon positions of the two compared genes. Only values deviating from unity by 10% or less are reported. Genes are ordered in the matrix according to their overall G+C content. Several observations stem from the examination of this matrix. Firstly, some clusterization appears. This is particularly evident in the mitochondrial genes and, less clearly, in the G+C rich genes (% G+C in the range 55-65%). Secondly, the fact that the two halves of the matrix are grossly symmetrical with respect to the diagonal axis (left blank in the figure for the sake of clarity) suggests that in the case of coding sequences the correlation coefficient reflects, to a certain extent, the similarity in third position choice. However, a close inspection reveals that there are several cases in which t values are high
and yet no significant correlation is found. Conversely, there are cases in which a significant correlation coefficient is found between pairs of genes which do not follow the rule that 3rd position choices reflect the overall G+C content (examples of both cases can be observed by comparing column 1 and row 1). Therefore the correlation analysis yields additional information. A particularly interesting case is that of the mitochondrial gene ND6 (n.36 in the matrix) which is the only gene located on the heavy strand of the mitochondrial genome. The asymmetry in G/C distribution between the two strands obviously does not affect the t value, but does strongly reduce the overall similarity with other mitochondrial sequences.
Comparison of CORRELA to FASTA in database screening In order to test the efficacy of the CORRELA program in screening a large database, a sample of recently sequenced genes (not yet present in the EMBL release 27 of 31.03.1991) were chosen as test sequences and compared against the whole EMBL release 27, using both CORRELA and FASTA programs for similarity searches. The three chosen sequences are: the protein kinase regulator from Homo sapiens (entry name: HS3B5H5E, accession number: X55997, length: 1528 bp); the ribosomal RNA small subunit from Aspergillus fumagatus (AFDA, M55626, 1798 bp); the hydrogenase transcriptional activator HOXA gene from Alcaligenes eutrophus (AEHOXA, M64593, 1811 bp). The results are given in Table I. The 90 sequences extracted by FASTA as those giving the best similarity score to each of the test sequences were listed in order of decreasing 'opt' score value.
Nucleic Acids Research, Vol. 20, No. 1 135 Table I. Results of similarity searches on the EMBL database (release 27) performed by FASTA and CORRELA for three test sequences taken from the daily EMBL updating, but not present in release 27: HS3B5H5E (Accession number X55997), AFDA (Accession number M55626) and AEHOXA (Accession number M64593). For each test sequence the first section of the table reports the names (preceded by the taxonomic class code) of the sequences extracted by FASTA, ranked in the order of decreasing 'opt' score value, then the 'opt' score value itself and finally the correlation index r. An asterisk highlights r values higher than 0.5. The next sections of the table report the numbers of sequences not present in the FASTA list and having an r value higher than 0.7, or comprised between 0.5 and 0.7, respectively. The last section reports, for each test sequence, the CPU times required for FASTA and CORRELA searches. The FASTA package used is the one installed on the EMBL VAXCluster which includes a set of VAX6000 machines. CORRELA analysis was performed on a VAX3900 machine. R"uts of si-itarity IS3ltSE I=FAI NK e*elrh .pinst ENKcr1leo.
27
3.q
apt
P619:sb66
6105 6105
apt
r
7192
1.06' 0.67 0.94' 0.77
0.76'
96:AOPr*
"78
4
0.64
F: CgrrI6
6060
0.97'
656
4
0.6
R3:Pirp6s
6665
I:6t36.d IK:D:3bd.6
3614
0.74*
2605
36:6aDbb36 M3:lthO6b
205
0.73' 0.76'
AW:6o.rr _ FUN:Cmr6S RP:Tdsrsr
S13 S60
2762
0.66
no:Cg0rrnm
P61:66100*2
67
P61:1&36661
672
0.62
P61I66h6b
566
0.66'
Ih:6366o60 tlV:D_h92cd
Il:D_rb3o 169:D5
0.64'
566
125
125 123 117
42
Pt6:t66163 m0:45p2. 366:_ P6I:Hsraaf&2 06:6A25923 ItO O:mcntro P1l:6tifi56r P61:f0tf93
Plt6l7arrra 5724 S701 PIU:4*rr16s
0.71 0.71'
5637
S577
0.66' 0. 73
0.33
IW:Pol6*rrl RUM:Cal6s R3:6nd.
54
076
0.56
Re:Ceso
523
0.61'
RM:Ckoob
5223
0.62' 0.66'
5177
103
0.48
R6:Cparm. FU: "ran PlM:Ckrr6n6
5136
I6W:Aoui&
5110 5109
P66:Cvsrm
S106
0.41 0.57'
0.51'
P.n:cwmow
*
PL.:S
FUW:Erpot
95
10V:Pcstl
95
< 0.22 0.21 0.42
t:Nthbb
P61
366:NwWyn
PItl:Nsbri2c iN6:_s43 I69:D"ooo
0 0
0.610.30
P4U:CfOrPn6t PLU Oarrl7s
PLN:Noerrnil
4971 495
P
0.62' 0.70' 0.69
60:
0.62'
0.74' 6.714'
61
0.62'
608
0.54'
607
P3:Rtnif.
S45
P0:Rtnifpre
537
Pm0:66tff.
0.55' 0.49
517
0.51'
0:6
51
0.75'
U6A:6524
377 314
P60:Ecytr
0.53'
PM0:8a19frl2 P90:6twrb
0.46
PMO:fcrpicpu
0.61' 0.57'
P 0
115
s
6.597
P6O:Wi16
437 4900 4783 466
0.5
o.n-
517
0.6
0.797
0.28
0.76
660 660
569
0.7
O0.6
67
707
0.66' 0.66'
P:0r10
0."4 0.71-
4662 4559
87
719
669
0.70'
0.42
I:Ourml16
67
P0:6ntr PN0:Awtif&o o:_ct PNO: ttd
0.73'
0.74'
PLN:Odb
0.23
PNO:Avnifnf
0.72'
570
O.50
0.41 0.45 0.26
749 741 741
PIO:Ec0ttat
0.45 0.21
69
P0O:Acnif PRO:CcftId PIIO:Cctrpx
0.72'
0.4
60
0.61'
591
91 91 91
0C3:Cbosx-2
0.76'
m
P3atInif7bm 765 PRO:oni,ftl 749 P3O:Acn6 f2
6*2
PI.:Pmro17s PIN:Nprr16s PW:Abac
760
PMD:Atntrc
0.76'
760
P3O:Eegtn P36:td1t* d
0.67'
4962
26g3
PNr:2gr17s
r
959
0.MI 0.71' 0.67'
0.660 4977
apt
PRD:A0nifa PN lpnif P6:0nif
0.71'
o.s5
89
1p4
S61
mI6t PV:AOtrml
-r
PO:K5arbe P6:1nif9
91
W3L:Orf0.6lO 300:4
0.67' 0.77
5693
1t 0.25 0.23
110 109
94 94 95 92
0.71'
0.6
97
:cinvot
0.73
5700
95
36:CIdS6o
0.74'
96:S2cr,. PU:ScrOs
95
l0m:Tbgre*43 VRT:X3tk0
579
114
91
0.81' 0.7
FUP:Rrpstr
P61
0.4s 0.36
247
0.74'
137 133
0.61'
8: 1
123
0.16
169::Ncdsf
122
0.26
P6O:Ujfixoj M:Ijrppn2
119
0.697
117
0.73'
R:Opoprms
4506
PLN:Skol6
4502
0.5s
M D:A69c PN0:P_
PILN:Crgl6ot
4502
0.5O9
P6O:A6ni9bq
112 112
PLN:PslSrm PIN:Ddl0rrm P0.:6sl056 ,m
4501 4460
0.63'
110 110
0.60'
0.27 0.69 O.", 0.71' 0.11 0.3s
O.5
U6A82626 PR1:Nsir3ol
30:45lpl FUl:Sexrnl FUl:Sck_ig
87 87
0.45
4474
0.57'
W1T:X0Lmc
0.29
P.l:6l1Porm
"41
o.5r
07
0.23
4395
0.07
RCO:Nmtp3
87 85 as 8S
0.49
P0N:3sl1rr IW9:A.srol6
4346
O0.5
0.32
P0.:Cphirrne
0.67'
109 109 107 107 105
0.46 0.47
6W:6Orrrn IW:Oqrm16t
4269 416
V3T:Xtw1.f PDO:D09o8 PMO:A.ofxpl VtT :Xt1.i i
0.56'
P61:69.poe
104
.0 0. 55
0.54'
P61:Phpoo
64
4155 4149
104 101 100
0.17
W9
0.29
96 96
0.17
RP:Scstpb FUN: Sc.pl
FUN:Scpp$o 60:6 6nrb R3O:Nintbr IlV:Tg217? IlV:D6tropil V9T:Cgvi.l FPU:Sectr VL:Cooodr VRT:6Gt02
P6I:Nsdxwihc FtX:Scpkelc 66:orof i 00:Ts3sblu 300:4tf9c Il0:Ddcam V03T:66014 :i0:D03
0.26
I6:OpornSt
83 83 82 82
0.11 0.16 0.21
V3T:XI rgwi4
01
0.47 0.66'
0.34
W1T:Xtrmoil R3::Ctsr" o:Iorns NW:ARowme
3920 3926
0.56' 0.10 0.41
3909
0.40 0.69 0.11
P31:UG
3777
0.22
0.49
37
0.36
78
0.19
376S
is
0.39 0.17 0.41 0.10
me:&ro RO:_n4rw18 6W:3OromlS PI3:ftiUpnr
3722 3676 3670
0.38 0.40 0.597 0.62'
3628 3600
1.66'
P6:Phgcrl Ill:Ec_um 6:lt sl PO:EopPo
3600
0.60'
*3:Rnfor6pr
PO:Acvnfdgk P3:Acvnf
69 69
NM:Utodro
U
PLIN:Asr4nl6t P0.:Gtowa
F16:Lgli6s PUN:grgl6sL
0.35
0.44
78
31596
0.43
77
0.20 0.25
PLN Gsal"
0RG:6131 P6I:Nsdsg
PLN:0Cvodd
0.45
0.586
V5L:VN
77
A:Ocerrn,01 PlN:oel6s
356 3490 343 3139
139V:0OVAdr 76
ltlV:06t
76
RCD:Narpb
76 76 76
0.30 0.65' 0.32 0.37
0.650 0.506 0.52'
INViPtasrm 169:pc03
PLN:Tarrl6a
2673
0.52'
IllV:Dd4
269 2364 2229 2229 2192
0.52' 0.54'
7n
0.34
11VTpu.1286
0.56
fUl:Pw16t
ROD:6rwm1
0.29
Fux:Po16o
73
0.60'
P00:6660
75
0.26
PUN:3916sl PU06:o1681
7n
0.51'
71
O."
gm
66:Nmas
36:Ne1tr
71
0.52'
INV:Pbrsss
0.56' 0.03
FLPU:Yisrs
Wll:Cktrus
1245
PLM:l.sy
71
pIt:NIsomm
74
0.41
I:Acrges
PLN:0hsph
74
0.15
IlV:Pbrdc
Further sequences fro CORRELA search with r > 0.7
2192 2153 1653 1613 1322
0.53'
s
36_Ne:t0 ce
FUN:Cltsrsu FlP:Nrrrn"s llV:TwPi6o
0.29
92 92
0.07
90
90 89
0 0.6'
0.32
:e 0.12
m:etetc mm:0t*l.
66
0.06
86
0.09
66
0.15
66
0.33
06
0.33
X6 X6 U2
0.16
0.71'
66:Itpkic
61
0.37
0.7
0.650 0.55' 0.53*
0.46
0.35
0.6r 0.55'
:
266:622024
so
I:N"btpsyb Plt0:6stpsyl P31":Nsr132 169:Cs.pol
79 79
P
77 77
*0
0.09 0 .0
0.21 0.04
0.10
112
0
15
00:42:34 00:01:59
0.53' 0.46
P1oI:0coi211 19T1:Ogrkin
916 824
4203
00:16:16 00:01:55
93
0.63'
0.63'
1210
6
Further sequenes from COOtRELA s*arch with 0.5 r