Research Article

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 7, 2015 # Mary Ann Liebert, Inc. Pp. 1–11 DOI: 10.1089/cmb.2014.0200

A Dynamic Programming Algorithm For (1,2)-Exemplar Breakpoint Distance ZHEXUE WEI,1 DAMING ZHU,1 and LUSHENG WANG 2

ABSTRACT The exemplar breakpoint distance problem is motivated by finding conserved sets of genes between two genomes. It asks to find respective exemplars in two genomes to minimize the breakpoint distance between them. If one genome has no repeated gene (called trivial genome) and the other has genes repeating at most twice, it is referred to as the (1, 2)-exemplar breakpoint distance problem, EBD(1, 2) for short. Little has been done on algorithm design for this problem by now. In this article, we propose a parameter to describe the maximum physical span between two copies of a gene in a genome, and based on it, design a fixedparameter algorithm for EBD(1, 2). Using a dynamic programming approach, our algorithm can take O(4sn2) time and O(4sn) space to solve an EBD(1, 2) instance that has two genomes of n genes where the second genome has each two copies of a gene spanning at most s copies of the genes. Our algorithm can also be used to compute the maximum adjacencies between two genomes. The algorithm has been implemented in C11. Simulations on randomly generated data have verified the effectiveness of our algorithm. The software package is available from the authors. Key words: algorithm, breakpoint, exemplar, genome, span.

1. INTRODUCTION

F

inding conserved sets of genes (or sequences) between two genomes is one of the fundamental problems in comparison of genomes. A set of genes that is conserved in the same order in genomes suggests that they participate in the same biological process. In order to find conserved sets of genes in genomes, many similarity measures have been proposed, for example, the number of breakpoints, the number of adjacencies, the number of conserved intervals, and the number of common intervals (Angibaud et al., 2008a,b). All these measures are well defined when genomes do not contain gene duplicates. However, gene duplication is one of the primary driving forces in the evolution of genomes and occurs frequently (Dennis et al., 2012). When duplicates occur, a gene may appear in several places of a genome. The so-called exemplar breakpoint distance just arises from comparing genomes where we assume that duplications are more recent than the speciation.

1

School of Computer Science and Technology, Shandong University, Jinan, China. Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong.

2

1

2

WEI ET AL.

Following the notations of Blin et al. (2007, 2009), a genome is a sequence of genes over a set of gene families, where a gene is an occurrence of a gene family. An exemplar of a genome is a subsequence of that genome, in which each gene family occurs exactly once. The exemplar breakpoint distance problem, EBD for short, is defined as follows: given two genomes, the task is to find two exemplars, one for each of the two genomes, such that the breakpoint distance between the two exemplars is minimized. A special case where we want to find two identical exemplars for the two given genomes is referred to as the zero exemplar breakpoint distance problem, ZEBD for short. If each gene family occurs in those two genomes for at most p and q times respectively, the exemplar breakpoint distance (resp. zero exemplar breakpoint distance) problem is emphasized as the ( p, q)-exemplar breakpoint distance (resp. zero breakpoint distance) problem, EBD( p, q) [resp. ZEBD( p, q)] for short. Sankoff (1999) gave a branch and bound algorithm for EBD. Nguyen et al. (2005) then proposed a divide and conquer method for this problem and developed a program that combines both the divide and conquer and the branch and bound technique. The effectiveness of this approach was verified by using bacterial genomes as the input data. Bryant (2000) first showed that EBD(1, 2) is NP-hard. Chen et al. (2006) then showed that ZEBD(3, 3) is NP-Hard. Later in 2010, Blin et al. (2009) showed that ZEBD(2, 2) is also NP-Hard. This implies that it is NP-hard to approximate EBD( p, q) within any performance ratio in polynomial time if p ‡ 2, q ‡ 2. There are more hardness results in terms of approximation. Angibaud et al. (2008a) first showed that EBD(1, 2) is APX-hard. Recently, Bulteau and Jiang (2012) showed that the (1, 2)-exemplar distance problem is hard to approximate for a wide variety of distance measures. To be specific, they showed that it is NP-hard (1) to approximate (1, 2)-exemplar MAD distance within ratio-2; (2) to approximate (1, 2)-exemplar SAD dispffiffiffi tance within ratio-(10 5 - 21); and (3) to approximate (1, 2)-exemplar signed reversal distance within ratio-1237 1236. Moreover, they showed that (1, 2)-exemplar Levenshtein distance and (1, 2)-exemplar Hamming distance are both APX-hard. As for the fixed-parameter computational intractability, Zhu (2009) showed that EBD(2, 2) rejects any FPT-algorithm that uses the output distance as one of the parameters. Several exact algorithms have been proposed in recent years. Blin et al. (2009) designed an O(n2n) time algorithm for ZEBD by a color-coding technique (Alon et al., 1995), where n is the number of gene families that occur in the genomes. They also gave a parameterized algorithm for ZEBD with O(m22ss3) time, where m is the size of the shortest input genome and s the maximum number of genes that are spanned by two genes of the same gene family in a genome. Fu and Zhang (2011) developed an O(2mnO(1)) time algorithm for EBD(1, q), where m is the number of the gene families and n the number of genes in the genome with gene duplicates. Jiang (2011) showed that ZEBD(1, q) can be solved in polynomial time. Zhu and Wang (2013) proposed an algorithm for ZEBD(2,2) with running time O(n21.86121n), where n is the number of gene families in the genomes. Other approaches for EBD (resp. ZEBD) can be found in (Angibaud et al., 2008b; Blin et al., 2007; Bonizzoni et al., 2007; Chen et al., 2006). Although EBD(1, 2), as the most fundamental exemplar breakpoint distance problem, has been known to be APX-Hard for several years, little is done in terms of algorithm design. In fact, whether EBD(1, 2) admits a fixed-parameter algorithm remains open until now. In this article, we design a fixed-parameter algorithm for EBD(1, 2) using span of the second genome, say s, as a parameter, where the span of a genome is introduced to represent the maximum number of genes that are spanned by two genes of the same gene family in that genome. This depends on a practical observation that two genes of the same gene family are usually separated by a small number of genes in a genome (Shi et al., 2010). Our algorithm runs in O(4sn2) time and O(4sn) space for an EBD(1, 2) instance with n gene families in each of its genomes. We have implemented the algorithm in C + + . Simulations on randomly generated data show that our algorithm can always handle a genome of 5000 gene families within 1000 seconds when the span of the genome is 10, and can handle a genome of 1000 gene families within 1600 seconds when the span of the genome is 15.

2. PRELIMINARIES Let S be a set of symbols, each representing a gene family. A genome on S is a string over S, where each gene in the string is an occurrence of a gene family. A gene family can occur in a string (or genome) for more than once. We do not distinguish a genome with a string if there is no special need. Two genes are identical if they are both the occurrences of the same gene family. Two strings are identical, if their genes

ALGORITHM FOR (1,2)-EXEMPLAR BREAKPOINT DISTANCE

3

from left to right are identical successively. Let X be a string on S. For a gene family g ˛ S, we denote by occ(X, g) the number of occurrences of g in X. Let occ(X) = max{ occ(X, g) j g ˛ S }. A gene family g is trivial in X if occ(X, g) = 1, and nontrivial if occ(X, g) > 1. A gene of a trivial (nontrivial) family in a string is also trivial (nontrivial). A string is trivial if all its genes are trivial and nontrivial otherwise. The breakpoint distance has been introduced as a similarity measure of two trivial genomes. If two genomes or strings, say G1 and G2, are both trivial, then two genes in G1 or G2 can be adjacent for at most once. Thus two adjacent genes form an adjacency in G1 with respect to G2, if they are adjacent in G2. Otherwise, they form a breakpoint in G1 with respect to G2. For two arbitrary trivial genomes G1 and G2, there must be as many breakpoints in G1 with respect to G2 as those in G2 with respect to G1. Thus the number of breakpoints in G1 with respect to G2 is referred to as the breakpoint distance between G1 and G2, and denoted as d(G1, G2). To measure the similarity of generic genomes, Sankoff (1999) introduced the exemplar model, which keeps only one gene for each gene family and goes back to the breakpoint distance for two trivial genomes. This gives rise to the exemplar breakpoint distance problem. An exemplar of a genome (resp. string) is a subsequence of the genome (resp. string) in which each gene family occurs exactly once. An exemplar of a genome must be trivial. The exemplar breakpoint distance problem, EBD for short, is given by two genomes, say G1 and G2 over S, and asks to find the exemplars P1 of G1 and P2 of G2, such that the breakpoint distance between P1 and P2 is minimized. As stated in section 1, EBD turns into EBD( p, q), if occ(G1) = p, occ(G2) = q. Moreover, EBD( p, q) is reduced to ZEBD( p, q), if it asks to find two identical exemplars of G1 and G2 respectively. The breakpoint distance between an exemplar of G1 and an exemplar of G2 is referred to as the exemplar breakpoint distance between G1 and G2, if it is minimized over all pairs of respective exemplars of G1 and G2. In what follows, we denote by ebd(G1, G2) the exemplar breakpoint distance between G1 and G2. Since EBD(1, 2) is NP-Hard, but ZEBD(1, 2) can be solved in polynomial time ( Jiang, 2011; Zhu and Wang, 2013), EBD(1, 2) can be regarded as the most fundamental one in those computational intractable exemplar breakpoint distance problems (Bryant, 2000). Formally, EBD(1, 2) can be stated as, Instance: Two genomes G1 and G2 over S, where occ(G1) = 1, occ(G2) = 2. Objective: Find an exemplar P2 of G2 such that the breakpoint distance between G1 and P2 is minimized, that is, d(G1, P2) = ebd(G1, G2). Let G = G[1] . . . G[m] be a genome on S. For convenience, we use G[i, j] to represent the consecutive subsequence G[i] . . . G[j] of G‚ ipj. For j ‡ i, the span of G[i] and G[j] in G is j - i. The span of a genome (or string) is the maximum span over all pairs of identical genes in that genome. For example, the span of G = 1, 4, 3, 2, 4, 1, 5, 2, 6 is 5, because G[1] = G[6] = 1, and the span of any other two identical genes in G is less than 5. In practice, there are many genomes with small spans (Shi et al., 2010). In the next section, we show that EBD(1, 2), although is NP-hard, can be solved in polynomial time if the span of G2 is restricted to be a constant.

3. THE ALGORITHM FOR EBD(1,2) In this section, G1 stands for a trivial genome on S, while G2 a genome on S with occ(G2) = 2 and s ‡ 2 as its span. For a string X on S, the length of X is the number of genes in X, and denoted as jXj. To find an exemplar of G2 that has as few as possible breakpoints with respect to G1, it is necessary to compute the breakpoint distances between G1 and the exemplars of G2. To get the breakpoint distance between G1 and an exemplar of G2, we have to pay attention to how many breakpoints a substring of that exemplar has. The following notations are used to formalize the computation for finding that exemplar we cried for. Let X be an arbitrary string on S. We denote by E(X) the exemplar set of X. For e ˛ E(X), we denote by b(e, G1) the number of breakpoints in e with respect to G1. An exemplar, say e of X is minimum with respect to G1, if b(e‚ G1 ) = minf b(e0 ‚ G1 ) j e0 2 E(X) g.

3.1. Dynamic programming to find a minimum exemplar Let G2 = G2 [1]‚ . . . ‚ G2 [m]. Since a gene family in S can occur in G2 for at most twice, we refer to G2[i] for 1 £ i £ m as the first occurrence of its gene family in G2, if G2[1, i - 1] does not contain any identical gene to G2[i]; otherwise, we refer to G2[i] for 1 £ i £ m as the second occurrence of its gene family in G2.

4

WEI ET AL.

In all exemplars of G2[1, i], one must be able to turn into a minimum exemplar of G2 with respect to G1, by appending an exemplar of G2[i + 1, m] to the right side of it. Fortunately, if the span of G2 is s, then only by maintaining a group of exemplars for G2[1, i] which, in number, is bounded by a function of s, can we get a minimum exemplar of G2 with respect to G1. For a string X on S and a gene x, we represent by X $ x the string formed by appending x to the right side of X. Lemma 1. Let e be an exemplar of G2[1, i - 1]. If G2[i] is the first occurrence of its gene family in G2, then e $ G2[i] is an exemplar of G2[1, i]. Proof. Since G2[i] is the first occurrence of its gene family in G2, e does not contain any gene identical to G2[i]. Thus e $ G2[i] must be an exemplar of G2[1, i]. Notice that any exemplar of G2[1, i] must contain an identical gene to G2[i]. Thus Lemma 1 implies that if G2[i] is the first occurrence of its gene family in G2, the exemplar set of G2[1, i] is just the set of strings e$G2[i], where e is an exemplar of G2[1, i - 1]. For a string X and a gene x, we denote by X yx the subsequence of X with the exclusion of those genes identical to x. Lemma 2. Let e be an exemplar of G2[1, i - 1]. If G2[i] is the second occurrence of its gene family in G2, then both e and (eyG2[i])$G2[i] are exemplars of G2[1, i]. Proof. Since G2[i] is the second occurrence of its gene family in G2, e must contain an identical gene to G2[i]. Thus e must be an exemplar of G2[1, i]. Consequently, (eyG2[i])$G2[i] is also an exemplar of G2[1, i]. Note again that any exemplar of G2[1, i] must contain an identical gene to G2[i]. Thus Lemma 2 implies that if G2[i] is the second occurrence of its gene family in G2, then the exemplar set of G2[1, i] is exactly the set of strings e as well as (eyG2[i])$G2[i] where e is an exemplar of G2[1,i - 1]. Let X = X[1] . . . X[k] be an arbitrary string of length k on S, we refer to X[k - t, k] as the t-terminal of X if k ‡ t; we refer to X itself as the t-terminal of X, if k < t. A t-terminal of a string is, in fact, a suffix of length t + 1 of that primary string. If two or more exemplars of G2[1, i] have the same s-terminal, we argue that at most one of them can grow up into a minimum exemplar of G2 with respect to G1. Lemma 3. Let e1 and e2 be two exemplars of G2[1, i - 1] with at least s + 1 genes, while G2[i] the first occurrence in G2. If e1 and e2 have the same s-terminal, and b(e1, G1) £ b(e2, G1), then e1$G2[i] and e2$G2[i] have the same s-terminal, and b(e1$G2[i], G1) £ b(e2$G2[i], G1). Proof. Let e1 = e1 [1]‚ . . . ‚ e1 [k], e2 = e2 [1]‚ . . . ‚ e2 [k]. By Lemma 1, since G2[i] is the first occurrence in G2, e1$G2[i] and e2$G2[i] are both exemplars of G2[1, i]. If G1 has a substring of length 2 that is identical to e1[k]G2[i], then e1[k]G2[i] is an adjacency in e1$G2[i] with respect to G1. Thus b(e1$G2[i], G1) = b(e1, G1). Now that e1[k - s, k] = e2[k - s, k], b(e2$G2[i], G1) = b(e2, G1) holds as well. Thus b(e1$G2[i], G1) £ b(e2$G2[i], G1) follows from b(e1, G1) £ b(e2, G1). If G1 has no substring of length 2, which is identical to e1[k]G2[i], then b(e1$G2[i], G1) = b(e1, G1) + 1, and b(e2$G2[i], G1) = b(e2, G1) + 1. Thus b(e1$G2[i], G1) £ b(e2$G2[i], G1) follows from b(e1, G1) £ b(e2, G1). Lemma 4. Let e1, e2 be two exemplars of G2[1, i - 1] with at least s + 1 genes, while G2[i] the second occurrence in G2. If e1 and e2 have the same s-terminal, and b(e1, G1) £ b(e2, G1), then (e1yG2[i])$G2[i] and (e2yG2[i])$G2[i] have the same s-terminal, and b((e1yG2[i])$G2[i], G1) £ b((e2yG2[i])$G2[i], G1). Proof. Let e1 = e1 [1]‚ . . . ‚ e1 [k]‚ e2 = e2 [1]‚ . . . ‚ e2 [k]‚ e01 = (e1 nG2 [i])  G2 [i]‚ e02 = (e2 nG2 [i])  G2 [i]. Since G2[i] is the second occurrence in G2 and the span of G2 is s, an identical gene to G2[i] must be in e1[k - s, k] and e2[k - s, k] as well. Since e1[k - s, k] = e2[k - s, k], thus (e1[k - s, k]yG2[i])$G2[i] = (e2[k - s, k]yG2[i])$G2[i]. That is, e10 [k - s, k] = e20 [k - s, k]. Since the span of G2 is s, G2[i]se1[k - s], G2[i]se2[k - s]. Thus, b(e10 , G1) = b(e1, G1) - b(e1[k - s, k], G1) + b(e10 [k - s, k],G1), b(e20 , G1) = b(e2, G1) - b(e2[k - s, k], G1) + b(e20 [k - s, k], G1). Consequently, b(e10 , G1) £ b(e20 , G1) follows from b(e1, G1) £ b(e2, G1). -

ALGORITHM FOR (1,2)-EXEMPLAR BREAKPOINT DISTANCE

5

Let E be a set of trivial strings on S each of which has at least s + 1 genes. Then the s-terminal set of E is formulated as T(E, s) = {e[jej - s,jej] je ˛ E}. A subset of E is referred to as an s-core of E, if it has the same s-terminal set as that E has, and its cardinality is just jT(E, s)j. An s-core of E is minimum, if with respect to G1, each string in it has no more breakpoints than those any string in E with an s-terminal identical to that of the string has. More formally, let M = {e ˛ Ej b(e, G1) £ b(e0 , G1), e0 [je0 j - s,je0 j] = e[jej - s,jej], e0 ˛ E }, if jMj = jT(E, s)j, then M is a minimum s-core of E. A minimum s-core of E(G2[1, i]) always contains a string that can grow into a minimum exemplar of G2 with respect to G1, because, Lemma 5. to G1.

A minimum s-core of E(G2[1, i]) must contain a minimum exemplar of G2[1, i] with respect

Proof. Let K be a minimum s-core of E(G2[1, i]). Let e be an exemplar in E(G2[1, i]) whose breakpoints with respect to G1 are minimized in number over all exemplars in E(G2[1, i]). Since K is an s-core of E(G2[1, i]), there is an exemplar, say e0 ˛ K, which has the identical s-terminal that e has. Since K is minimum, b(e0 , G1) £ b(e, G1). Thus e0 must be a minimum exemplar of G2[1, i] with respect to G1. Let E be a set of trivial strings of length k on S. Then a minimum s-core of E can be found from E in O(k$jEj) time. In what follows, we denote by MK(E) the subroutine that returns a minimum s-core of E. By Lemma 5, we can arrive at a minimum s-core of E(G2) by recursively computing the minimum s-core of E(G2[1, i]), 1 £ i £ m. To formalize how to get a minimum s-core of E(G2[1, i]) from the minimum s-core of E(G2[1, i – 1]), we extend the usage of the operators ’$’ and ’y’ onto a set of trivial strings, say E, and a gene, say x. That is, (1) E $ x h { e$x j e ˛ E }; (2) Eyx h { eyx j e ˛ E }. Theorem 1. For 1 £ i £ m, let KE[i - 1] be a minimum s-core of E(G2[1, i - 1]). Then a minimum s-core of E(G2[1, i]) can be computed by, 8 fG2 [i]g‚ i = 1: > > < (1) KE[i] = MK(KE[i - 1]  G2 [i])‚ iq2‚ G2 [i] is the first occurrence in G2 : > > : MK(KE[i - 1] [ ((KE[i - 1]nG2 [i])  G2 [i]))‚ otherwise:

Proof.

If i = 1, the proof is trivial. Later, let i ‡ 2.

-

(1) If G2[i] is the first occurrence in G2, then every exemplar of G2[1, i] must be in E(G2[1, i - 1])$G2[i] by Lemma 1. Since KE[i - 1] is an s-core of E(G2[1, i - 1]), each string in E(G2[1, i - 1])$G2[i] must correspond to a string in KE[i - 1]$G2[i], which has the identical s-terminal that it has. That is, MK(KE[i - 1]$ G2[i]) must be an s-core of E(G2[1, i]). For an arbitrary exemplar, say e ˛ E(G2[1, i - 1]), let e0 be that exemplar in KE[i - 1] that has the identical s-terminal that e has. Since KE[i - 1] is the minimum core of E(G2[1, i - 1]), b(e0 , G1) £ b(e, G1). By Lemma 3, e0 $G2[i] has an s-terminal identical to that of e$G2[i] and b(e0 $G2[i], G1) £ b(e$G2[i], G1). Namely, MK(KE[i - 1]$G2[i]) is a minimum s-core of E(G2[1, i]). (2) If G2[i] is the second occurrence in G2, then by Lemma 2, every exemplar of G2[1, i] must be in E(G2[1, i - 1]) or (E(G2[1, i - 1])yG2[i])$G2[i]. Since KE[i - 1] is an s-core of E(G2[1, i - 1]), a string in E(G2[1, i]) must correspond to one in KE[i - 1] or (KE[i - 1]yG2[i])$G2[i], which has the identical s-terminal that it has. Let e be an arbitrary exemplar in E(G2[1, i - 1]). Then KE[i - 1] has a string, say e0 , which has the identical s-terminal that e has. Since KE[i - 1] is a minimum s-core of E(G2[1, i - 1]), b(e0 ,G1) £ b(e,G1). By Lemma 2, e0 and (e0 yG2[i])$G2[i] are both exemplars of G2[1, i]. By Lemma 4, (e0 yG2[i])$G2[i] and (eyG2[i])$G2[i] have the same s-terminal, and b((e0 yG2[i])$G2[i],G1) £ b((eyG2[i])$G2[i],G1). To sum up, MK(KE[i - 1]W((KE[i - 1]yG2[i])$G2[i])) must return a minimum s-core of E(G2[1, i]).

6

WEI ET AL.

FIG. 1. The dynamic programming for finding a minimum exemplar of G2 with respect to G1.

By Theorem 1, the minimum exemplar in E(G2[1, m]) or E(G2) with respect to G1 can be found by the dynamic programming in Figure 1, which is named as Exemplar(G1, G2). In this algorithm, min(KE) stands for a subroutine to find the string in KE whose breakpoints with respect to G1 are minimized in number over all strings in KE.

3.2. The complexity of the algorithm Let KE[i] be the set of exemplars made by the ith for circulation of Exemplar (G1, G2). By Theorem 1, KE[i] must be a minimum s-core of E(G2[1, i]) with respect to G1. The complexity of this algorithm depends on the cardinality of KE[i]. Thus in this subsection, we evaluate on how many strings KE[i] has. By Lemma 5 and Theorem 1, if two or more exemplars in E(G2[1, i]) have the same s-terminal, then at most one of them can stay in KE[i]. Thus to evaluate how many strings KE[i] has, it suffices to evaluate how many strings the s-terminal set of KE[i] has. If the length of a string in KE[i] is at most s, then at most s gene families can occur in G2[1, i]. Since each gene occurs in G2 at most twice, KE[i] has at most 2s strings. If the length of a string in KE[i] is at least s + 1, we show that KE[i] has at most 4s strings. To hit this point, we bound the number of genes that can become members in the s-terminals of those exemplars of G2[1, i]. Note that two genes are different, even if they are identical in a genome. A gene, say g, takes part in a substring of an exemplar of G2[1, i], if the substring of that exemplar happens to contain g as a gene. Lemma 6. Let u be an integer with 1 £ u < i. If just t gene families occur in G2[u + 1, i] rather than G2[1, u], then any gene in G2[1, u] cannot take part in the (t - 1)-terminal of an exemplar of G2[1, i]. Proof. Let e be an arbitrary exemplar of G2[1, i]. If in G2[1, i], a gene is on the left of another, then the gene is on the left of that another in e, provided that they both take part in e. A gene family must occur in e, if it occurs in G2[1, i]. Thus if a gene, say g in G2[1, u], takes part in e, then each gene family that occurs in G2[u + 1, i] rather than G2[1, u] must occur in e on the right side of g. Since t gene families occur in G2[u + 1, i] rather than G2[1, u], at least t genes are on the right side of g in e. Thus, g cannot take part in the (t - 1)-terminal of e. We refer to an integer u with 1 £ u < i as the s-bound of G2[1, i], if in number, those gene families that occur in G2[u + 1, i] rather than G2[1, u] are totaled s, and those gene families that occur in G2[u, i] rather than G2[1, u - 1] are totaled s + 1. By Lemma 6, if u is an s-bound of G2[1, i], then G2[u] cannot take part in the (s - 1)-terminal of any exemplar of G2[1, i]; and moreover, a gene in G2[1, u - 1] cannot take part in the s-terminal of any exemplar of G2[1, i]. For example, let G2 = G2[1, 13] = 7, 6, 7, 5, 6, 1, 2, 5, 3, 1, 2, 3, 4. Then s = 4. The gene families 1, 2, 3, 4 occur in G2[5, 13] rather than G2[1, 4]. Moreover, the gene families 1, 2, 3, 4, 5 occur in G2[4, 13] rather than G2[1, 3]. Thus the s-bound of G2[1, 13] is 4. If no more than s gene families occur in G2[1, i], then any exemplar of G2[1, i] has no more than s genes. Otherwise, the s-bound of G2[1, i] can be identified by the following. Lemma 7. Let at least s + 1 gene families occur in G2[1, i]. Then, an integer u is the s-bound of G2[1, i], if and only if u = min { vj just s gene families occur in G2[v + 1, i] rather than G2[1, v], 1 £ v < i }.

ALGORITHM FOR (1,2)-EXEMPLAR BREAKPOINT DISTANCE

7

Proof. Let u = min { vj just s gene families occur in G2[v + 1, i] rather than G2[1, v], 1 £ v < i }. There must exist just s + 1 gene families that occur in G2[u, i] rather than G2[1, u - 1], because if otherwise, u is not minimized. Thus u is the s-bound of G2[1, i]. On the other hand, if u is the s-bound of G2[1, i], then u £ v, if just s gene families occur in G2[v + 1, i] rather than G2[1, v]. Otherwise, at most s gene families will occur in G2[u, i], a contradiction to that u is the s-bound of G2[1, i]. By Lemma 7, the s-bound of G2[1, i], if present, is unique. To evaluate how many strings KE[i] has, we bound the number of gene families that can occur on the right side of the s-bound in G2[1, i]. Lemma 8.

Let u be the s-bound of G2[1, i], then at most 2s gene families can occur in G2[u, i].

Proof. By Lemma 7, no gene in G2[1, u - 1] is identical to G2[u]. Otherwise, u is not the s-bound of G2[1, i]. Let C be the set of gene families that occur in G2[u + 1, i] and also G2[1, u]. Then jCj £ s - 1, because the span of G2 is s. Moreover, by the definition of the s-bound, just s + 1 gene families occur in G2[u, i] rather than G2[1, u - 1]. Thus at most 2s gene families can occur in G2[u, i]. Theorem 2.

A minimum s-core of E(G2[1, i]) has at most 4s strings.

Proof. Let u be the s-bound of G2[1, i]. Then by Lemma 6, only those genes in G2[u, i] can take part in the s-terminal of an exemplar of G2[1, i]. By Lemma 8, at most 2s gene families can occur in G2[u, i]. Thus it can set at most 22s = 4s trivial strings by the genes in G2[u, i]. That is, the s-terminal set of E(G2[1, i]) has at most 4s strings, and so has the s-core of E(G2[1, i]). Actually, we have not come across any instance of EBD(1, 2) for which more than 2 $ 3s exemplars have to be saved in running of Exemplar(G1, G2). For example, let x be a positive integer. If an EBD(1, 2) instance is given by G1 = G1[1, 5x], G2 = G2[1, 10x], where, G1 [1‚ 5x] = 1‚ 2‚ . . . ‚ 5x - 1‚ 5x; G2 [1‚ 3x] = 5x‚ 5x - 1‚ . . . ‚ 2x + 1; G2 [3x + 1‚ 6x] = G2 [1‚ 3x]; G2 [6x + 1‚ 8x] = 2x‚ 2x - 1‚ . . . ‚ 1; G2 [8x + 1‚ 10x] = G2 [6x + 1‚ 8x]‚

(2) (3) (4) (5) (6)

s then the span of G2 is s = 3x. The minimum s-core of E(G2[1, 10x]) has s 2s + 3 3 exemplars. Let’s concentrate on the running time to compute the minimum s-core of E(G2[1, i]) from that of E(G2[1, i - 1]). Note that KE[i] is a minimum s-core of E(G2[1, i]). For convenience, we denote by e = e[1]‚ . . . ‚ e[k] an arbitrary string in KE[i - 1]. (1) If G2[i] is the first occurrence in G2, then by Theorem 2, KE[i - 1]$G2[i] has at most 4s strings. It takes O(n) time to decide whether e[k]G2[i] is an adjacency or a breakpoint with respect to G1. That is, it can take O(4sn) time to find those sets of breakpoints of all strings in KE[i - 1]$G2[i], provided that the sets of breakpoints of all strings in KE[i - 1] have been already set. A minimum s-core of KE[i - 1]$G2[i] can be found in O(4sn) time, if we use a hash table to store those strings in KE[i - 1]$G2[i]. A string of length s + 1 is treated as a key of the hash table, if it is identical to the s-terminal of a string in KE[i - 1]$G2[i]. Setting the hash table for those strings in KE[i - 1]$G2[i] takes O(4s) time. (2) If G2[i] is the second occurrence in G2, (KE[i - 1] yG2[i])$G2[i] has at most 4s strings by Theorem 2. Moreover, KE[i - 1]W((KE[i - 1]yG 2[i])$G2[i]) has at most 2$4s strings. Let e0 = (eyG 2[i])$G2[i]. Then as mentioned before, it takes O(n) time to compute b(e0 [k - s, k], G1) as well as b(e[k - s, k], G1). That is, it can take O(4sn) time to find those sets of breakpoints of all strings in (KE[i - 1]yG2[i])$G2[i]. Following the arguments of (1), a minimum s-core of KE[i - 1]W((KE[i - 1]yG2[i])$G2[i]) can be found in O(4sn) time. Since the algorithm has to compute KE[i] for 1 £ i £ m £ 2n, the time complexity of the algorithm must be O(4sn2). Thus, Theorem 3 follows.

Theorem 3. The algorithm Exemplar(G1, G2) can always return a minimum s-core of E(G2) in O(4sn2) time and O(4sn) space.

8

WEI ET AL.

FIG. 2. The running time for computing the exemplar breakpoint distance of G1 and G2 with n gene families, where the nontrivial gene families of G2 are bounded by 90%, 80%, and 70% of all, respectively, and the span of G2 is 10.

4. SIMULATION We have implemented the algorithm Exemplar(G1, G2) in C + + . As noted in section 2, if a gene family occurs in a genome only once, then it is trivial in that genome, otherwise, nontrivial. Before evaluating the running time of Exemplar(G1, G2) by real data, we focus on how to generate two genomes G1 and G2 as an EBD(1, 2) instance. Given the set S of n gene families, we can get G1 = G1[1, n] by generating a random permutation, in which each gene family in S occurs once. Let s be given as the span of G2. Let in G2, p be given as the percentage of those nontrivial gene families from all in S. Then G2 can be generated by the following computations: (1) Let m = n + np, G2 = G2[1, m], G2[x] = NULL, 1 £ x £ m. (2) Insert genes into G2 for those nontrivial gene families: select a gene family, say g in S; randomly, select two integers, say i, j, which subject to jj - ij £ s and G2[i] = G2[j] = NULL; G2[i] ) G2[j] ) g. This insertion of two genes of g transforms G2 into one that is still denoted as G2. (3) Repeat (2) until in number, the nontrivial gene families in G2 achieve a given percent of all gene families in S. (4) Insert genes into G2 for those trivial families by randomly selecting a gene family, say g ˛ S, and an integer, say i with G2[i] = NULL and set G2[i] by a gene of g. For convenience, we name this subroutine to randomly generate a trivial genome G1 and a nontrivial genome G2 with span s and p percent of nontrivial from all gene families in S as RandData(S, s, p), where S, s, and p are given as the input parameters of it. Using RandData(S, s, p) to generate genomes G1, G2 as the EBD(1, 2) instance constantly, we evaluate the real running time of Exemplar(G1, G2) in three aspects. Firstly, fixing the span of G2, we observe the running time increase rates of Exemplar (G1, G2) along with the size increases of the genomes. For the span of G2 fixed to be 10, the nontrivial gene families in G2

FIG. 3. The running time for computing the exemplar breakpoint distance of G1 and G2 with 1000 gene families, where G2 has span from 6 to 15. The three fold lines are used to represent the running time increase rates for G2 with 1900, 1800, and 1700 genes respectively.

ALGORITHM FOR (1,2)-EXEMPLAR BREAKPOINT DISTANCE

9

FIG. 4. The running time for computing the exemplar breakpoint distance of G1 and G2 with 1000 gene families, where G2 has span from 21 to 75. For G2 with 1300, 1200, 1100 genes respectively, the running times for Exemplar(G1, G2) are depicted as those three fold lines.

bounded to be 90%, 80%, and 70% of all in S, the average running times of Exemplar(G1, G2) are depicted in Figure 2. Each data point in Figure 2 comes from the average running time of 20 samples on a personal computer of Core(TM) i5-2400 CPU and 4G memory. In our experiment for recording the running time of Exemplar(G1, G2), the algorithm can always accomplish its computation within 1000 seconds for genomes of 5000 gene families, where 90% of them are nontrivial. Then, fixing the number of gene families in S, we observe the running time increase rates of Exemplar(G1, G2), along with the span increases of G2. For the number of gene families fixed to be 1000, as well as G2 with spans from 6 to 15, the average running times of Exemplar(G1, G2) are depicted in Figure 3. Each data point in Figure 3 comes also from the average running time of 20 samples on a personal computer of Core(TM) i5-2400 CPU and 4G memory. For all our running time recordings of Exemplar(G1, G2), the algorithm can always accomplish its computation within 1600 seconds for genomes with 1000 gene families. Particularly, we stay and observe the situation when G2 has few and far between nontrivial genes. Let G1 and G2 both be on a set of 1000 gene families. For the span of G2 going from 21 to 75, the average running times of Exemplar(G1, G2) are depicted in Figure 4. Each data point in Figure 4 comes from the average running time of 20 samples on the same computer as that for Figure 2. Recall that the total running time of our algorithm is O(4sn2). So naturally, Figure 3 implies that the running time of Exemplar(G1, G2) will increase dramatically with the span increases of G2, if G2 has plenty of nontrivial genes. By Figure 4, moreover, we notice that Exemplar(G1,G2) has a good behavior

FIG. 5. The maximum number of exemplars stored in running Exemplar (G1, G2) as a function of exemplar breakpoint distance between G1 and G2. The data points come from 100 pairs of genome samples with 5000 gene families occurring in G1 as well as G2, and 8500 genes in G2, where the span of G2 is 10.

10

WEI ET AL.

FIG. 6. The maximum number of exemplars stored in running Exemplar (G1, G2) as a function of exemplar breakpoint distance. The data points come from 100 pairs of genome samples with 1000 gene families occurring in G1 as well as G2, and 1700 genes in G2, where the span of G2 is 15.

for those genomes with span larger than 20 but few nontrivial genes. Actually, the distribution of those nontrivial genes in G2 becomes the most important factor to influence the behavior of Exemplar(G1, G2), if G2 has few nontrivial genes. For example, if G2 has span 36, and 20% of its genes are nontrivial, then Exemplar(G1, G2) can finish for finding the minimum exemplar of G2 in 60 seconds, if those nontrivial genes in G2 are distributed uniformly. On the other hand, if those nontrivial genes accumulate together in G2, the computation of Exemplar(G1, G2) may go out of the memory bound of the computer. The same behavior of Exemplar(G1, G2) happens to those situations where G2 has 10 and 30 percent of nontrivial genes. Finally, we pay attention to those pairs of genomes generated by RandData (S, s, p), as well as their exemplar breakpoint distance distributions, and the maximum memory spaces used for computing their exemplar breakpoint distances. Note that the cardinality of the minimum s-core of E(G2[1, i]) can represent the memory space used for computing the exemplar breakpoint distance between G1 and G2. Based on r pairs of randomly generated genomes, say Gj1 ‚ Gj2 ‚ 1pjpr‚ we can set a figure of r data points with the exemplar breakpoint distances of Gj1 and Gj2 as their horizontal coordinates, and the maximum number of exemplars in the minimum s-cores of E(Gj2 [1‚ i]) as their vertical coordinates, 1 £ j £ r. We select two groups of genome pairs, one of which has those nontrivial genomes with span 10, the other of which has those nontrivial genomes with span 15, to observe the exemplar breakpoint distance distributions and the memory spaces for computing the exemplar breakpoint distances. The data points in Figure 5 come from running Exemplar(G1, G2) on a personal computer of Core(TM) i5-2400 CPU and 4G memory for 100 pairs of genomes, say Gj1 ‚ Gj2 ‚ 1pjp100, where Gj1 = Gj1 [1‚ 5000] is trivial, Gj2 = Gj2 [1‚ 8500] is nontrivial and has span 10. The data points in Figure 6 come from running Exemplar(G1, G2) on the same computer as before for 100 pairs of genomes, say Gj1 ‚ Gj2 ‚ 1pjp100, where Gj1 = Gj1 [1‚ 1000] is trivial, Gj2 = Gj2 [1‚ 1700] is nontrivial, and has span 15.

5. CONCLUSIONS By a dynamic programming approach, we have presented a parameterized algorithm for EBD(1, 2). This algorithm can run in O(4sn2) time and O(4sn) space to solve an EBD(1, 2) instance with a trivial genome and a genome of span s, on a set of n gene families. Actually, this article paid no attention to those genomes with signed genes. Of course, our algorithm is available if the genomes in EBD(1,2) are signed, where the complexity analysis is slightly different. Moreover, our algorithm can also be used to compute the maximum adjacencies between two genomes. Since we have not come across any EBD(1, 2) instance that can take O*(4s) running time of our algorithm, it is interesting to give a better complexity analysis for this algorithm. It is also interesting to know if one can follow the method in this article to solve EBD(1, q) (q > 2) by restricting the span of that nontrivial genome.

ALGORITHM FOR (1,2)-EXEMPLAR BREAKPOINT DISTANCE

11

ACKNOWLEDGMENTS Daming Zhu is supported by (1) National Natural Science Foundation of China: 61472222 and (2) Natural Science Foundation of Shandong Province: ZR2012Z002. Lusheng Wang is supported by a GRF grant from Hong Kong SAR government Project No. [CityU 123013].

AUTHOR DISCLOSURE STATEMENT No competing financial interests exist.

REFERENCES Alon, N., Yuster, R., and Zwick, U. 1995. Color coding. J. ACM 42, 844–856. Angibaud, S., Fertin, G., and Rusu, I. 2008a. On the approximability of comparing genomes with duplicates. Lect. Notes Comput. Sci. 4921, 34–45. Angibaud, S., Fertin, G., Rusu, I., et al. 2008b. Efficient tools for computing the number of breakpoints and the number of adjacencies between two genomes with duplicate genes. J. Comput. Biol. 15, 1093–1115. Blin, G., Chauve, C., Fertin, G., et al. 2007. Comparing genomes with duplications: a computational complexity point of view. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 523–534. Blin, G., Fertin, G., Sikora, F., et al. 2009. The exemplar breakpoint distance for nontrivial genomes cannot be approximted. Lect. Notes Comput. Sci. 5431, 357–368. Bonizzoni, P., Vedova, G.D., Dondi, R., et al. 2007. Exemplar longest common subsequence. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 535–543. Bryant, D. 2000. The complexity of calculating exemplar distances, 207–212. In Sankoff, D., and Nadeau, J.H., eds. Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment, and the Evolution of Gene Families. Kluwer, Amsterdam. Bulteau, L., and Jiang, M. 2012. Inapproximability of (1,2)-exemplar distance. Lect. Notes Comput. Sci. 7292, 13–23. Chen, Z., Fu, B., and Zhu, B. 2006. The approximability of the exemplar breakpoint distance problem. Lect. Notes Comput. Sci. 4041, 291–302. Dennis, M.Y., Nuttle, X., Sudmant, P.H., et al. 2012. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922. Fu, B., and Zhang, L. 2011. A polynomial algebra method for computing exemplar breakpoint distance. Lect. Notes Comput. Sci. 6674, 297–305. Jiang, M. 2011. The zero exemplar distance problem. J. Comput. Biol. 18, 1077–1086. Nguyen, C.T., Tay, Y.C., and Zhang, L. 2005. Divide-and-conquer approach for the exemplar breakpoint distance. Bioinformatics 21, 2171–2176. Sankoff, D. 1999. Genome rearrangement with gene families. Bioinformatics 15, 909–917. Shi, G., Zhang, L., and Jiang, T. 2010. MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement. BMC Bioinform. 11, 1–14. Zhu, B. 2009. Approximability and fixed-parameter tractability for the exemplar genomic distance problems. Lect. Notes Comput. Sci. 5532, 71–80. Zhu, D., and Wang, L. 2013. An exact algorithm for the zero exemplar breakpoint distance problem. IEEE/ACM Trans. Comput. Biol. Bioinform. 10, 1469–1477.

Address correspondence to: Prof. Lusheng Wang Department of Computer Science City University of Hong Kong 83 Tat Chee Avenue Kowloon, Hong Kong People’s Republic of China E-mail: [email protected]

A Dynamic Programming Algorithm For (1,2)-Exemplar Breakpoint Distance.

The exemplar breakpoint distance problem is motivated by finding conserved sets of genes between two genomes. It asks to find respective exemplars in ...
1MB Sizes 1 Downloads 7 Views