Measurement of word frequencies in genomic DNA sequences based on partial alignment and fuzzy set.

Journal of Bioinformatics and Computational Biology Vol. 12, No. 4 (2014) 1450019 (12 pages) # .c Imperial College Press DOI: 10.1142/S021972001450019X

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by NANYANG TECHNOLOGICAL UNIVERSITY on 03/03/15. For personal use only.

Measurement of word frequencies in genomic DNA sequences based on partial alignment and fuzzy set

Fumiya Shida* and Satoshi Mizuta† Graduate School of Science and Technology, Hirosaki University 3 Bunkyo-cho, Hirosaki, Aomori 036-8561, Japan *[email protected] † [email protected] Received 23 February 2014 Revised 28 June 2014 Accepted 7 July 2014 Published 1 August 2014 Accompanied with the rapid increase of the amount of data registered in the databases of biological sequences, the need for a fast method of sequence comparison applicable to sequences of large size is also increasing. In general, alignment is used for sequence comparison. However, the alignment may not be appropriate for comparison of sequences of large size such as whole genome sequences due to its large time complexity. In this article, we propose a semi alignmentfree method of sequence comparison based on word frequency distributions, in which we partially use the alignment to measure word frequencies along with the idea of fuzzy set theory. Experiments with ten bacterial genome sequences demonstrated that the fuzzy measurements has the e®ect that facilitates discrimination between close relatives and distant relatives. Keywords: Alignment-free; linguistic approach; rank correlation.

1. Introduction The estimation of similarity or dissimilarity between biological sequences is one of the central tasks in bioinformatics. In general, alignment global alignment1 or local one2 is used to do that for relatively short sequences such as nucleotide sequences of genes or amino acid sequences of proteins. However, the computational time complexity of pair-wise sequence alignment, which is O(L 2 ) for sequences of length L, is so large for long sequences that the alignment is not appropriate for comparing long sequences such as whole genome ones. Therefore, along with improvements in alignment-based methods,3–5 various alignment-free methods for sequence comparison have been broadly studied by many researchers.6,7 One of the major categories of the alignment-free methods for sequence comparison is linguistic approach.8–21 The basic strategy of the methodology is as follows: At ¯rst, a certain set of words W is constructed, in which a word is de¯ned as a short 1450019-1


F. Shida & S. Mizuta

sequence fragment of a ¯xed length; subsequently, the occurrence frequencies of the individual words in W are measured in target genomes; then the distances between the genomes are computed based on a distance measure between the frequency distributions. Although a \word" is referred to in various ways, K-tuple,10 k(n)-word,11,12,16,19 l-mer,14,15K-string,17 and so on, they are basically equivalent to each other. The size of word set W , denoted by jW j, is commonly set to the total number of words of length k (i.e. 4 k for DNA sequences), but a certain size jW j 4 k is also adopted in some studies.18 As for distance measures, Euclidean distance,21 Mahalanobis distance,19 rank correlations,18,22 symmetric Kullback–Leibler divergence,16 Jensen– Shannon divergence,14 and cosine angle metric10,13,17 have been widely used. Recently, some authors introduced the position information of the detected words into the similarity estimation.8,9 In this article, we propose a novel method to measure the frequency distributions of words in genome sequences. In our method, we use the alignment for comparison between a word and a subsequence in a genome and measure the occurrence frequencies of the words based on the idea of fuzzy set theory.23 2. Methods At ¯rst, we construct a set of words W , and then measure the occurrence frequencies of the individual words in W by alignment between each word in W and the sequence fragments in the target genome. Now, we describe our methods in detail below. 2.1. Construction of representative word set Let k be the word length, then the total number of possible words is 4 k for DNA sequences, for there are four types of nucleotides, A, T, G, and C. To construct word set W , we selected Nð 4 k Þ words out of 4 k randomly, which we call representative words (RWs), imposing that they were apart from each other by a prede¯ned distance to avoid deviations among the members of W . Now, we explain the procedure for constructing W . Let us consider, for example, the similarity between the following two words of length nine: AGTCGTGAC and GTCGTGACT. They are evaluated to be completely different by a similarity measure based on Hamming distance the distance is equal to 9, which is the full length of the words. However, given that they are well aligned as shown in Fig. 1, it seems to be appropriate that they are regarded analogous with each other. Therefore, we selected RWs so that the score of global alignment between

Fig. 1. Alignment of two words which have a large Humming distance but are well aligned. 1450019-2

Measurement of word frequencies in genomic DNA sequences


Fig. 2. Alignment of two words which can be selected for RWs simultaneously.

any pair of the words was equal to or lower than threshold ST ¼ ðk 3Þy þ 6z, where y and z are the match gain and the gap penalty, respectively. This threshold ST roughly corresponds to Hamming distance 3 under the condition 2z x, where x is the mismatch penalty. Note that Hamming distance 3 between a pair of RWs guarantees that any single word observed on a genome sequence is never counted as more than one RWs simultaneously under the condition that the two words having Hamming distance 1 are treated as equivalent. We used the alignment parameters y ¼ 1, z ¼ 1, and x ¼ 4 in selecting RWs. With these parameter values, we obtain ST ¼ 0 for k ¼ 9. For example, the alignment between two words AGTCGTGAC and ACTGGTAAC, which have Hamming distance 3, is given in Fig. 2. As the alignment score is computed to be 6y þ 6z ¼ 0, the both words can be selected for RWs simultaneously. On the other hand, because the alignment score of Fig. 1, which is calculated to be 6, is larger than threshold ST , at least one of the words in Fig. 1 is not selected as a RW. Now, we construct word set W according to the following algorithm. At ¯rst, we randomly select a word w from the list of all the words of length k, Tk ; we delete w from Tk and add w to W . Then, we repeat the following steps until jW j reaches N; we randomly select w from Tk ; we delete w from Tk and calculate the alignment scores with the above mentioned parameters between w and the individual words in W ; if the scores are all equal to or less than ST , we add w to W ; otherwise, we discard w. One of the author (SM) discussed the appropriate word length k in comparing genome sequences based on the linguistic approach24 and obtained a relationship between k and genome size NG , NG 4 k . If we set NG to the typical size of bacterial genomes 10 6 , then we obtain k 10. Therefore, we take into account word length k ranging from 8 to 12, allowing °exibility. Moreover, because we could obtain slightly more than 100 words when we selected RWs in the above mentioned manner with k ¼ 8, we selected N ¼ 100. In the previous study,24 the estimated distances between genome sequences showed little dependence on the number of RWs N; therefore, N ¼ 100 was applied to all the values of k. 2.2. Measurement of word frequencies We measured the occurrence frequencies of the individual RWs by comparing each RW with the fragments in genome sequences by global–local alignment global for the RWs and local for the fragments so as to take the in°uences of insertion and deletion events into account. We set the length of the fragment to 2k, adding a leading and a trailing margins to the word length k, and we proceeded the measurement by sliding a window of size 2k by k characters over the target genome sequences. 1450019-3



As for the alignment, we adopted an a±ne gap model; the a±ne gap penalty pa is calculated by pa ¼ po þ ðm 1Þpe , where po and pe are the gap opening and the gap extension penalties, respectively, and mð 1Þ is the gap length. Along the lines of database search program Gapped BLAST,25 in which the gap opening and the gap extension penalties are selected to be 11 and 1, respectively, we used po ¼ 2 and pe ¼ 0:2 (only their relative ratio has a meaning), and we used a match score 2 so that it could compensate for a gap of length 1. In deciding the value of mismatch penalty, we di®erentiated between a transition and a transversion. The transition is a mutation between purine bases (A and G) or pyrimidine ones (T and C), and the transversion is a mutation between a purine and a pyrimidine bases. Because the transition is known to happen more frequently than the transversion, we gave a severer penalty to the transversion than the transition; 3 for the transversion and 1 for the transition. We measured the occurrence frequencies of the individual RWs based on the idea of fuzzy set theory.23 The membership function, which provides the degree of belongingness of a fragment to the individual RWs, is de¯ned on the alignment score S as ðS SX Þ=10 þ 1 ðSX 10 S SX Þ fðSÞ ¼ ð1Þ 0 ðS < SX 10Þ; where SX is the maximum value of S. Note that SX is given for the alignment between a RW and a fragment including the RW on it.

2.3. Distance measure based on Kendall's rank correlation coe±cient Kendall's rank correlation coe±cient,22 or Kendall's , is a nonparametric measure of the degree of correlation between two sets of measurements. Although the number of occurrence of a certain word in a genome sequence depends on the genome size, the Kendall's is applicable to comparison between genome sequences of signi¯cantly di®erent sizes, as it refers to only the rank order of the measurements instead of their absolute values. Here, we brie°y describe the procedure for computing Kendall's . Let Ai and Aj be the numbers of occurrences of word wi and wj in genome A, respectively, and similarly let Bi and Bj be those of the corresponding words in genome B. Then we de¯ne the following four variables for the i; j ði < jÞ pairs; P is the number of concordant pairs in which ðAi Aj ÞðBi Bj Þ > 0 is satis¯ed; Q is the number of discordant pairs in which ðAi Aj ÞðBi Bj Þ < 0 is satis¯ed; r is the number of one kind of tie pairs in which Ai ¼ Aj and Bi 6¼ Bj are satis¯ed; and s is the number of the other kind of tie pairs in which Ai 6¼ Aj and Bi ¼ Bj are satis¯ed. If both Ai ¼ Aj and Bi ¼ Bj are satis¯ed, the corresponding i; j pairs are excluded from the computation. Now, Kendall's is calculated from these variables as P Q ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : P þQþr P þQþs 1450019-4

ð2Þ


Kendall's lies between 1 and 1, and it takes 1 when the rank orders of the two measurements are completely in agreement with each other, and 1 when they are in complete reversal with each other. We re-scale the Kendall's as D¼1

þ1 ; 2

ð3Þ

so that distance D lies between 0 and 1.


3. Results 3.1. Genomes analyzed The genomes analyzed in this study are listed in Table 1. We selected two species from each of four phyla, proteobacteria, ¯rmicutes, other bacteria, and archaea, as well as two Escherichia coli species, K-12 and O-157, which we included to check the self-consistency of our method. Two species in proteobacteria and in ¯rmicutes are close relatives of each other, respectively, and two species in archaea are distant relatives of each other. Additionally, two species in other bacteria are in the middle position of each other between close and distant relatives. All the genome sequences were downloaded from GenBank.26 3.2. Select the word length To begin with, we investigated the dependence of Kendall's on word length k. Figure 3 shows the relationships between and k for three pairs of the genome sequences, E. coli K-12 and O-157 (same species), V. cholerae and Y. pestis (close relatives), and V. cholerae and P. horikoshii (distant relatives), in which the averages of 10 trials and their standard deviations as error bars are plotted. We can recognize in this ¯gure that is almost independent of word length k in this range. Therefore, considering the computational time the smaller k is, the shorter the computational time is and the abundance of RWs the larger k is, the more abundant RWs are we ¯x k ¼ 9 in the remaining of this paper. Table 1. List of genomes analyzed. Phyla

Species

Abbr. a

Length (bp)

Proteobacteria

Vibrio cholerae Yersinia pestis Bacillus halodurans Bacillus subtilis subsp. subtilis Deinococcus radiodurans Thermosynechococcus elongatus Archaeoglobus fulgidus Pyrococcus horikoshii

V.cho Y.pes B.hal B.sub D.rad T.elo A.ful P.hor

2961149 4653728 4202352 4215606 2648638 2593857 2178400 1738505

Escherichia coli K-12 Escherichia coli O-157

K12 O157

4639675 5498450

Firmicutes Other Bacteria Archaea Escherichia a Abbreviations

of species used in ¯gures and tables. 1450019-5

F. Shida & S. Mizuta 1 K12-O157

Kendall’s τ

0.8 0.6

V.cho-Y.pes 0.4 0.2 0 V.cho-P.hor -0.2 8

9

10

11

12


Word length k

Fig. 3. Dependence of Kendall's on word length k. The averages of 10 trials and their standard deviations as error bars are plotted (the error bars of E. coli pair are too small to be seen).

3.3. Distances between genomes and phylogeny Table 2 shows a distance matrix computed from Eq. (3) among all the species analyzed with a set of RWs shown in Table 3. The smallest distance 0.02 is obtained for E. coli pair, and the largest distance 0.82 is obtained for P. horikoshii and D. radiodurans pair. The fuzzy count distributions of the set of RWs in Table 3 for three pairs of species, E. coli K-12 and E. coli O-157, D. radiodurans and V. cholerae, and D. radiodurans and P. horikoshii, which show the minimum, the middle, and the maximum distances in Table 2, respectively, are shown in Fig. 4. As expected, the distributions between the same species, E. coli K-12 and E. coli O-157, represent almost identical appearance, though the absolute values of the occurrence frequencies are slightly di®erent from each other. On the other hand, the distributions between distant relatives, D. radiodurans and V. cholerae, and D. radiodurans and P. horikoshii, are quite di®erent. We reconstructed a phylogenetic tree by NEIGHBOR program in PHYLIP27 based on the neighbor-joining method28 from the distance matrix shown in Table 2. Figure 5 depicts the obtained tree, which is drawn by DRAWGRAM program in PHYLIP.27 The overall topology of the tree agrees with that of the phylogenetic tree given in Fig. 2 of Ref. 13, except that our tree has an internal branch realizing a

Table 2. Distance matrix among 10 species listed in Table 1.

Y.pes B.hal B.sub D.rad T.elo A.ful P.hor K12 O157

V.cho

Y.pes

B.hal

B.sub

D.rad

T.elo

A.ful

P.hor

K12

0.16 0.25 0.23 0.52 0.43 0.35 0.43 0.22 0.22

0.30 0.26 0.53 0.43 0.44 0.44 0.21 0.20

0.10 0.71 0.62 0.30 0.25 0.41 0.41

0.67 0.58 0.30 0.28 0.37 0.37

0.23 0.54 0.82 0.35 0.36

0.46 0.66 0.34 0.35

0.31 0.44 0.44

0.60 0.59

0.02

1450019-6


RW

%GC

RW

%GC

RW

%GC

RW

%GC

RW

%GC

CGGGCCAGA GGGGGGGGC GTTAGATTA GACCCACGC ACTACATCC ATGACTAAT AAATTGGAA CCGGCCCTA AGTATTCGG ACAAGGCAC ACATCTATG TCGGCTATC AGGTCCATA CCGATCAAT GACTTTTTA ATTGTATTT CTTTTTGGG GCTGCTGAC ATGCTGCCA AAAGCCAAT

77.8 100.0 22.2 77.8 44.4 22.2 22.2 77.8 44.4 55.6 33.3 55.6 44.4 44.4 22.2 11.1 44.4 66.7 55.6 33.3

TCAATTACG TCCTCTTTG CTGGAGGCG AGTTCTTGC GAACCCTAA CCAAAAGTC GCGAACCAG ACGGTTCCC CTAGCATAT TTGGCCGCT GTACTACTA AATACGATA GGCCGAGTC CAGGCCCGC TCTAGAGAC GAAGACAGT CCTCGCGTG AAAGGTTTG TTGTCGGGG CGCTTGTTC

33.3 44.4 77.8 44.4 44.4 44.4 66.7 66.7 33.3 66.7 33.3 22.2 77.8 88.9 44.4 44.4 77.8 33.3 66.7 55.6

ACTTAAGTT GGCTCCTTT GTCCGGAAG AGCTTCCGT GTCGCCCCC AGAATGTAG ATATTGCCT CCTTGTGGA GAGTTTAAA CTGTTCGGT CTGAAACCT GATGGGTCA AATAAAAAG CGTAAATTC ATTCGGGCC ATCTCGGAT CTGATTTCA ATGTCGTAC CCCAGATGC CTTATGCCC

22.2 55.6 66.7 55.6 88.9 33.3 33.3 55.6 22.2 55.6 44.4 55.6 11.1 33.3 66.7 44.4 33.3 44.4 66.7 55.6

CTCTTCCAT TAATCTCCC TTGGTAAAA GTTTTCCAG CATGAACGG ACTGGGAAC GATAAGAGC GTCATCAAC GAAAATCCA TAGATCAGA GTTCGTCGA CACGAATAC GGGGGCATA CTCAATGGG AAACCTGGC GTTAACAAT GACGACGAA ACCGGTCAG CTATATCTT GAGGACTTT

44.4 44.4 22.2 44.4 55.6 55.6 44.4 44.4 33.3 33.3 55.6 44.4 66.7 55.6 55.6 22.2 55.6 66.7 22.2 44.4

GCGGGACGG TAAATACAC CCCTTCGCC TTGTTTAAG GGTTAGCCA ACGTATGCA CTACTCAGG GGTTAACGG GTGCGGCGC CCCAAACAA CCACACGGA AGGCTTGGG TTGAATGGT AAGACCCTT TTCTGCGCG CCCCTTGAG TCGAAAGCA TAGAGGATA CGCCGACTT TACAGCGGC

88.9 22.2 77.8 22.2 55.6 44.4 55.6 55.6 88.9 44.4 66.7 66.7 33.3 44.4 66.7 66.7 44.4 33.3 66.7 66.7

Frequency (×103)

3

K12 O157

30 25 20 15 10 5 0 0

20

40 Word

60

80

100

40 35 30 25 20 15 10 5 0

D.rad V.cho

0

20

40

60

Word

3 Frequency (×10 )

35 Frequency (×10 )


Table 3. List of a set of RWs with their GC-content.

80

100

40 35 30 25 20 15 10 5 0

D.rad P.hor

0

20

40

60

80

100

Word

Fig. 4. Frequency distributions of a set of RWs in Table 3 for the three pairs of species having the minimum distance, K12-O157, the middle distance, D.rad-V.cho, and the maximum distance, D.rad-P. hor. In each graph, the RWs are lined up on the horizontal axis in ascending order of the frequencies measured for the former species of the individual pairs.

Fig. 5. Phylogenetic tree constructed from the distance matrix in Table 2 using the neighbor-joining method. Values at nodes represent the number of times that each branch appears in the 100 jackknife trees. 1450019-7

0.14 (0.008) 0.25 (0.025) a 0.23 (0.027) 0.54 (0.030) 0.40 (0.020) 0.35 (0.014) 0.42 (0.032) 0.21 (0.018) 0.21 (0.017)

0.29 0.26 0.55 0.42 0.43 0.43 0.18 0.18

(0.024) (0.025) (0.038) (0.020) (0.017) (0.036) (0.019) (0.017)

Y.pes

0.10 0.71 0.58 0.31 0.24 0.38 0.38

(0.008) (0.012) (0.025) (0.017) (0.017) (0.032) (0.031)

B.hal

0.68 0.55 0.32 0.28 0.35 0.34

(0.021) (0.019) (0.014) (0.024) (0.026) (0.026)

B.sub

0.25 0.53 0.79 0.40 0.41

(0.022) (0.027) (0.025) (0.040) (0.040)

D.rad

1450019-8

0.19 0.24 0.25 0.53 0.43 0.36 0.44 0.28 0.31

(0.021) (0.023) (0.018) (0.028) (0.022) (0.026) (0.017) (0.034) (0.033)

0.32 0.28 0.54 0.44 0.43 0.46 0.20 0.20

(0.026) (0.021) (0.033) (0.022) (0.032) (0.011) (0.035) (0.036)

Y.pes

0.14 0.67 0.56 0.30 0.28 0.37 0.36

(0.014) (0.030) (0.022) (0.025) (0.021) (0.031) (0.031)

B.hal

0.62 0.54 0.32 0.33 0.31 0.30

(0.026) (0.012) (0.026) (0.016) (0.031) (0.031)

B.sub

0.30 0.50 0.76 0.42 0.42

(0.029) (0.024) (0.009) (0.034) (0.034)

D.rad

Note: The averages of 10 trials and their standard deviations (in parentheses) are shown.


V.cho

0.43 0.63 0.35 0.35

(0.028) (0.034) (0.029) (0.029)

T.elo

0.30 (0.010) 0.43 (0.025) 0.43 (0.025)

A.ful

0.42 0.60 0.42 0.43

(0.020) (0.028) (0.023) (0.023)

T.elo

0.32 (0.025) 0.43 (0.019) 0.42 (0.018)

A.ful

Table 5. Distance matrix among 10 species listed in Table 1 without fuzzy measurements.

Note: The averages of 10 trials and their standard deviations (in parentheses) are shown. a Hatched cells have larger values than those in the corresponding cells in Table 5.


V.cho

Table 4. Distance matrix among 10 species listed in Table 1 with fuzzy measurements.


0.55 (0.014) 0.54 (0.013)

P.hor

0.56 (0.039) 0.56 (0.039)

P.hor

0.02 (0.003)

K12

0.02 (0.003)

K12



Distance with Fuzzy

1 0.8 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1


Distance without Fuzzy

Fig. 6. Scatter plot between the distances calculated with and without fuzzy measurements.

bipartition fK12, O157, D.rad, T.elo j Y.pes, V.cho, B.hal, B.sub, A.ful, P.horg. Because E. coli, Y. pestis, and V. cholerae are all gamma proteobacteria, this bipartition is thought to be abnormal. However, small distances between species in those two groups, fK12, O157g and fY.pes, V.chog, given in Table 2, are not inconsistent with the fact that they are all close relatives. In reality, the aberrant bipartitioning branch disappears in a phylogenetic tree constructed from the same distance matrix by UPGMA method. To assess the reliability of the tree, we performed a jackknife test29–31 according to the following procedure: Randomly select 50 words out of 100 RWs; construct a distance matrix by our methods based on the selected 50 words; repeat the above steps 100 times and generate 100 jackknife trees by NEIGHBOR program in PHYLIP27; obtain a majority-rule consensus tree using CONSENSE program in PHYLIP27 with the number of times that each branch of the consensus tree appears in the jackknife trees (the values at the individual nodes in Fig. 5). 3.4. E®ects of fuzzy measurements In order to ascertain the e®ects of the fuzzy measurements, we compared the distance matrix calculated with the fuzzy measurements shown in Table 4 with that calculated without the fuzzy measurements shown in Table 5. In calculating the distances, we performed 10 trials with di®erent sets of RWs, and calculated the averages and the standard deviations of the results. In the scatter plot between the distances calculated with and without the fuzzy measurements (Fig. 6), we can see the tendency that the distances below 0.5 are smaller in the case of using fuzzy measurements than in the case of not using them, and the reverse is true for the distances above 0.5. That is, the fuzzy measurements make the distances smaller between closely related species and larger between distantly related species. 4. Discussions It is worth noting that the distances above 0.5 in Table 2 correspond to negative values of , which means that the corresponding frequency distributions of RWs have 1450019-9

40 35 30 25 20 15 10 5 0

D.rad P.hor


0

20

3

Frequency (×10 )

3

Frequency (×10 )


40 60 %GC

80

100

40 35 30 25 20 15 10 5 0

D.rad V.cho

0

20

40 60 %GC

80

100

Fig. 7. Scatter plots of the occurrence frequencies of a set of RWs versus their GC-contents measured for D. radiodurans and P. horikoshii, and for D. radiodurans and V. cholerae.

negative correlations. If the genome sequences of two species are completely unrelated, their frequency distributions of RWs would have no correlations, neither positive nor negative. This apparently strange appearance can be partly explained by a large di®erence in GC-content of their genome sequences. Figure 7 shows the scatter plots of the occurrence frequencies of the RWs in Table 3 versus their GC-contents for the species pair yielding the largest distance 0.82, D. radiodurans and P. horikoshii, and that yielding a middle distance 0.52, D. radiodurans and V. cholerae, for comparison. The GC-contents of D. radiodurans and P. horikoshii are 67.0% and 41.9%, respectively. Therefore, the RWs having large GC-contents tend to occur more frequently in D. radiodurans than in P. horikoshii, and vice versa, causing the negative correlation between the frequency distributions of RWs for the two species. On the other hand, no apparent correlation can be recognized in the plot for D. radiodurans and V. cholerae (the GC-content of V. cholerae is 47.7%). At the end of this section, we estimate the computational time complexity of our method. The alignment between a RW of length k and a fragment cut by a sliding window of width 2k on the genome sequence needs the computational time of Oðk 2 Þ; the number of the sliding windows with step k over the genome sequence of size L is OðL=kÞ; there are N RWs to compare with each fragment cut by a sliding window. Thus, the computational time to measure occurrence frequencies for each genome becomes Oðk 2 Þ OðL=kÞ N ¼ OðNkLÞ. Then computing the distance between the occurrence frequencies by Kendall's needs the computational time of OðN 2 Þ. As a result, the total computational time is given by OðNkL þ N 2 Þ, which can be reduced to OðLÞ if k and N are set to small constants compared to L. Actually, this is thought to be a practical condition we took k ¼ 9 and N ¼ 100 in this study. Note that, if we take k log4 L (see Sec. 2.1), the time complexity is modi¯ed to OðL log LÞ. However, the above consideration is valid even for that case, because log4 L L is satis¯ed for large L. 1450019-10


5. Conclusion


We proposed a novel method of sequence comparison based on word frequency distributions, in which we partially use the alignment to measure word frequencies along with the idea of fuzzy set theory. The time complexity of our method is OðLÞ for sequence length L under the realistic conditions, which is practical for comparing very long sequences. Applying our method to ten bacterial genome sequences, we obtained the result that the measurement of word frequencies with the idea of fuzzy set theory boosted the closeness and remoteness between genome sequences. This e®ect facilitates discrimination between close relatives and distant relatives.

References 1. Needleman SB, Wunsch CD, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol 48:443–453, 1970. 2. Smith TF, Waterman MS, Identi¯cation of common molecular subsequences, J Mole Biol 147:195–197, 1981. 3. Nakato R, Gotoh O, Cgaln: Fast and space-e±cient whole-genome alignment, BMC Bioinform 11:224, January 2010. 4. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL, Versatile and open software for comparing large genomes, Genome Biol 5(2):R12, January 2004. 5. Chain P, Kurtz S, Ohlebusch E, Slezak T, An applications-focused review of comparative genomics tools: Capabilities, limitations and future challenges, Brie¯ngs Bioinform 4(2):105–123, June 2003. 6. Vinga S, Almeida J, Alignment-free sequence comparison A review, Bioinformatics 19(4):513–523, 2003. 7. Mantaci S, Restivo A, Sciortino M, Distance measures for biological sequences: Some recent approaches, Int J Approximate Reason 47(1):109–124, 2008. 8. Ding S, Li Y, Yang X, Wang T, A simple k-word interval method for phylogenetic analysis of DNA sequences, J Theoret Biol 317:192–199, January 2013. 9. Huang Y, Wang T, Phylogenetic analysis of DNA sequences with a novel characteristic vector, J Math Chem 49(8):1479–1492, March 2011. 10. Aita T, Husimi Y, Nishigaki K, A mathematical consideration of the word-composition vector method in comparison of biological sequences, Bio Syst 106(2–3):67–75, November 2011. 11. Dai Q, Liu X, Yao Y, Zhao F, Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison, J Theoret Biol 276(1):174– 180, 2011. 12. Ding S, Dai Q, Liu H, Wang T, A simple feature representation vector for phylogenetic analysis of DNA sequences, J Theoret Biol 265(4):618–623, June 2010. 13. Yu Z-G, Zhan X-W, Han G-S, Wang RW, Anh V, Chu KH, Proper distance metrics for phylogenetic analysis using complete genomes without sequence alignment, Int J Mole Sci 11(3):1141–1154, January 2010. 14. Sims GE, Jun S-R, Wu GA, Kim S-H, Alignment-free genome comparison with feature frequency pro¯les (FFP) and optimal resolutions, Proc Nat Acad Sci 106(8):2677–2682, 2009. 15. Csűr€os M, Noe L, Kucherov G, Reconsidering the signi¯cance of genomic word frequencies, Trends Genet 23(11):543–546, 2007. 1450019-11



16. Wu T-J, Huang Y-H, Li L-A, Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences, Bioinformatics 21(22):4125–4132, November 2005. 17. Qi J, Wang B, Hao B-I, Whole proteome prokaryote phylogeny without sequence alignment: A K-string composition approach, J Mole Evol 58(1):1–11, January 2004. 18. Kirzhner VM, Korol AB, Bolshoy A, Nevo E, Compositional spectrum revealing patterns for genomic sequence characterization and comparison, Physica A 312:447–457, 2002. 19. Wu TJ, Burke JP, Davison DB, A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words, Biometrics 53(4):1431–1439, 1997. 20. Karlin S, Ladunga I, Comparisons of eukaryotic genomic sequences, Proc Nat Acad Sci USA 91(26):12832–12836, December 1994. 21. Blaisdell BE, A measure of the similarity of sets of sequences not requiring sequence alignment, Proc Nat Acad Sci USA 83(14):5155–5159, July 1986. 22. Kendall M, Gibbons JD, Rank Correlation Methods, 5th edn., Oxford University Press, USA, 1990. 23. Zadeh LA, Fuzzy sets, Information and Control 8(3):338–353, 1965. 24. Mizuta S, Hanya K, Speci¯cations of word set in linguistic approach for similarity estimation, Proc ISCA 2nd Int Conf Bioinformatics and Computational Biology, pp. 25–29, 2010. 25. Altschul SF, Madden TL, Schä®er AA, Zhang J, Zhang Z, Miller W, Lipman DJ, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res 25(17):3389–3402, September 1997. 26. GenBank. Available at http://www.ncbi.nlm.nih.gov/ (accessed 2 December 2013). 27. Felsenstein J, Phylip (phylogeny inference package) version 3.65, 2005. 28. Saitou N, Nei M, The neighbor-joining method: A new method for reconstructing phylogenetic trees, Mole Biol Evol 4(4):406–425, 1987. 29. Shi J, Zhang Y, Luo H, Tang J, Using jackknife to assess the quality of gene order phylogenies, BMC Bioinform 11:168, January 2010. 30. Luo H, Shi J, Arndt W, Tang J, Friedman R, Gene order phylogeny of the genus Prochlorococcus, PloS One 3(12):e3837, January 2008. 31. Luo H, Sun Z, Arndt W, Shi J, Friedman R, Tang J, Gene order phylogeny and the evolution of methanogens, PloS One 4(6):e6069, January 2009.

Fumiya Shida received M.Eng. in Electronics and Information Technology from Hirosaki University in 2014. His research interest was alignment-free sequence comparison. He currently works for a Japanese company, Fujitsu Broad Solution & Consulting Inc.

Satoshi Mizuta studied physics at Tohoku University, Japan, and received Ph.D. in Physics from the university in 1993. He is currently an associate professor of Electronics and Information Technology in the Graduate School of Science and Technology at Hirosaki University, Japan. His research interests are genome informatics and the evolution of genetic codes. 1450019-12

The exceptional genomic word symmetry along DNA sequences.

Fast alignment-free sequence comparison using spaced-word frequencies.

Compilation and alignment of DNA polymerase sequences.

Fast alignment of DNA and protein sequences.

Alignment of Escherichia coli K12 DNA sequences to a genomic restriction map.

Genomic signal processing methods for computation of alignment-free distances from DNA sequences.

Estimating evolutionary distances between genomic sequences from spaced-word matches.

Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values.

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Multiple microwave frequencies measurement based on stimulated Brillouin scattering with improved measurement range.

Implementation of Steiner point of fuzzy set.

Multicriteria Decision Making Method Based on the Higher Order Hesitant Fuzzy Soft Set.

Comparison of period-3 correlation amplitudes in genomic DNA sequences.

Some new sets of sequences of fuzzy numbers with respect to the partial metric.

Genomic feature extraction and comparison based on global alignment of ChIP-sequencing data.

Identifying potential tRNA genes in genomic DNA sequences.

Using fuzzy-set qualitative comparative analysis.

An integrated risk assessment model of township-scaled land subsidence based on an evidential reasoning algorithm and fuzzy set theory.

Simultaneous alignment and folding of protein sequences.

Word Meaning Frequencies Affect Negative Compatibility Effects In Masked Priming.

A set of BASIC programs to evaluate relationships among protein sequences by optimum alignment and distance matrix analysis.

Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts.

Global genomic diversity of human papillomavirus 6 based on 724 isolates and 190 complete genome sequences.

Chabertia erschowi (Nematoda) is a distinct species based on nuclear ribosomal DNA sequences and mitochondrial DNA sequences.