Accepted Article

Demystifying Computer Science for Molecular Ecologists Mahdi Belcaid∗ Hawaii Institute of Marine Biology , University of Hawaii. Robert J. Toonen Hawaii Institute of Marine Biology , University of Hawaii.

Abstract In this age of data-driven science and high-throughput biology, computational thinking is becoming an increasingly important skill for tackling both new and long-standing biological questions. However, despite its obvious importance and conspicuous integration into many areas of biology, computer science is still viewed as an obscure field that has, thus far, permeated into only a few of the biology curricula across the nation. A national survey has shown that lack of computational literacy in environmental sciences is the norm rather than the exception (Valle & Berdanier, 2012). In this paper, we seek to introduce a few important concepts in computer science with the aim of providing a context specific introduction aimed at research biologists. Our goal is to help biologists understand some of the most important mainstream computational concepts in order to better appreciate bioinformatics methods and trade-offs that are not obvious to the uninitiated.

1

Introduction

Next-Generation Sequencing (NGS) technologies have produced a substantial decrease in the cost and the complexity of generating sequence data, and are allowing researchers to tackle questions that were not previously possible. Along with this remarkable progress in data acquisition, parallel advances in computational sciences, such as in machine learning and high performance computing, ∗ Electronic

address: [email protected]

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1111/mec.13175 This article is protected by copyright. All rights reserved.

Accepted Article

are allowing researchers to answer complex biological problems using creative computational and quantitative techniques. For example, it is now possible to identify germline and somatic variants in thousands of individuals at reasonable costs, by employing powerful algorithms to analyze datasets generated using reduced-representation methods, such as RAD-seq (Marx, 2013; Pabinger et al., 2014). The scale and the complexity of these new problems require computational infrastructures beyond those typically available in a traditional molecular ecology lab and the computational background required to mine these datasets can be a major handicap. As such, a basic understanding of computer science is becoming an essential part in training the next generation of data-enabled biologists, not only as a tool during the inevitable integration of computer science in biology, but also to enable more productive interactions in the new era of multidisciplinary and large-scale biology. While an increasing number of undergraduate and graduate programs are including bioinformatics, programming or computer science in their curricula, precious few students seem to understand the principal computational concepts underlying the tools they use on a regular basis in their research (Pevzner et al., 2009; Valle & Berdanier, 2012). Computer science is a discipline that lays the theoretical foundations for the systematic study of processes that describe and transform information (Brookshear & Brookshear, 2003). Its applications are widespread and encompass, among other topics, the design of computer systems, algorithms, artificial intelligence, and data mining. In recent years, computer science has carved out an important niche in biology by contributing the theoretical foundations required to answer fundamental biological questions. This tight integration of computer science in biology has in fact led to the creation of a new field of research called bioinformatics. It consists of the application of informatics techniques, such as applied mathematics, computer science and statistics, to organize and understand the information associated with biological molecules (Luscombe et al., 2001). Here, we provide an introduction to computer science for molecular ecologists to understand the conceptual underpinnings of the field, the basic approaches that are used in bioinformatics and some of the trade-offs that different approaches require. We divide this review into major sections as follows: In the first sections, we introduce two important algorithm design techniques and discuss how their efficiency or lack thereof can be analyzed using simple concepts of complexity theory. In Section 4, we introduce two data structures, namely suffix trees and graphs, and demonstrate how they can be used to design better algorithms. In Section 5, we briefly discuss the major distinctions between programming and scripting languages, and in Section 6, we define some important High Performance Computing concepts. In Section 7, we introduce machine learning, a topic that is gaining increasing popularity in biology. We conclude in the final section with an original example of how computational theory can be applied to yield an approximation of an important compute-intensive task, namely sequence clustering. Here, we use restriction associated DNA sequencing (RAD-seq) as an example to highlight the subjects covered in this review. At the most basic level, variant calling from RAD data is simply a matter of: 1- initial quality con-

This article is protected by copyright. All rights reserved.

Accepted Article

Algorithms -- What computational task are you trying to solve? Section 2 and 3, complexity: brute force vs. heuristic approaches Fastq Files Data Structures -- how to efficiently organize, search and mine large volumes of data 1- Quality Control (Low quality Section 4, trees and graphs trimming, barcode removal, dereplication, etc...)

2- Clustering Implementing a solution -- Which language to use Section 5, compiled vs. interpreted languages

l oc us 1 ACGGAGAGAGT ACGGAGAGGGT ACGGAAAGAGT ACTGAAAGCGT ACGGAGAGAGT ACGGAAAGAGT ACGGAGAGCGT ACGGAGAGCGT ACGGAGAGAGT

l oc us 3 l oc us 2 ATTGATCGTGTGTAGC CCGTGATG ATTAATCGTGTGGAGC CCGTGATG ATTAATCGTGTGTAGC CCGTGATG ATTAATCGTGTGTAGC ATTGATCGTGTGTAGC ATTGATCGTGTGGAGC ATTGATCGTGTGTAGC ATTAATCGTGTGTAGC

ACGGAGAGAGT ATTGATCGTGTGTAGC CCGTGATG

...

...

Consensus Sequences High performance computing -- How to speed up the computation Section 6, task vs. data parallelism

3- Output formatting

4- SNP Calling

Output VCF Files

Computational thinking -- Recasting a problem Section 7, machine learning and PCA example Section 8, random projections and clustering

Fig. 1: Variant calling from RAD-seq data is divided into four high-level steps: 1- quality control, 2- loci inference, 3- output formatting and 4- SNP calling. These steps require familiarity with different computation concepts, which we discuss throughout this paper.

This article is protected by copyright. All rights reserved.

Accepted Article

trol; 2- inference of loci; 3- output formatting; and 4- SNP calling from the raw sequencing reads (See Figure 1). These steps are carried consecutively as a pipeline where the output of a program becomes an input for the next. However, the details of how to perform this relatively simple task become overwhelming for large datasets returned from next-generation sequencers, and the feasibility and amount of time required to perform each step requires familiarity with the subjects covered in Sections 2 through 8. Because this review is aimed at molecular ecologists rather than those trained in bioinformatics, we choose to provide intuitive examples over formal, mathematical rigor to facilitate understanding of the theoretical concepts in this text. Further, our examples focus primarily on sequence data with which we believe molecular ecologists are intimately familiar, but the concepts and approaches introduced herein are not domain specific.

2

What is an Algorithm

Algorithms represent the most fundamental concept in computer science and are routinely used in bioinformatics to gain important biological insight (Leiserson et al., 2001). Similar to recipes, algorithms unambiguously describe methods to solve well-defined problems using an enumerable set of steps (Gusfield, 1997). The operating principle of an algorithm is described utilizing a rigorous highlevel language termed pseudocode. Utilizing a mixture of keywords and prose, pseudocode describes the logic required to solve a problem, and is solely intended for human use, rather than for direct execution by a computer. As an illustration, the pseudo-code in Algorithm 2 describes a function, or collection of instructions, for computing the percent GC content in a DNA sequence. For convenience, we will call this function count p GC. The Algorithm iteratively inspects each base of the input sequence and counts the number of guanine (G) and cytosine (C) nucleotides it encounters. The final count is then converted into a percentage and printed. The header of the algorithm defines the function name and its input, i.e., the sequence name for which the percent GC will be computed. Lines 2 and 3 introduce two variables, or storage bins, that we use to record the cumulative length of the sequence and GC count as each base is inspected. For clarity, we name these variables gc counter and sequence length. Line 4 indicates the beginning of a “for” loop, a control structure enclosing instructions that will be executed iteratively. In this case, line 4 will be executed for each base of the input sequence, incrementing the sequence length variable at each iteration. Line 6 introduces another control structure, an “if” clause, which tests whether the base is either a G or a C, and if so, increments the gc counter (line 7). In line 8, the content of the gc counter is converted into a percentage and then printed. The table shows the execution trace for the sequence ACGGAT AACG.

This article is protected by copyright. All rights reserved.

Accepted Article

1: 2: 3: 4: 5: 6: 7: 8:

function count p GC(sequence) gc counter ← 0 sequence length ← 0 for each base in sequence do sequence length ← sequence length + 1 if base equals G or base equals C then gc counter ← gc counter + 1 print (gc counter ∗ 100)/sequence length

Fig. 2: Algorithm for computing the GC rate in a sequence. At each iteration, a base of the sequence is inspected and the variables are updated. The table shows an example execution trace for the sequence ACGGATAACG. The values for the variable sequence length and gc counter are shown for each iteration of the for-loop. The final result is computed after all the bases of the sequence have been inspected.

3

Problem Solving Techniques and Complexity Analysis

Once a computational problem has been formally defined, it is common practice to resort to a set of systematic strategies to solve it. In what follows, we present two of the most important such strategies, or algorithm design paradigms (see Definitions Box), and explicitly showcase the importance of a good paradigm in the overall efficiency of the solution. These systematic strategies are the brute force and the greedy approaches. The brute force approach, while rarely the most efficient, is often used to better understand the question as well as the structure of the solution space. In contrast, the greedy strategy often yields an acceptable solution while making more efficient use of available resources, but at the expense of not exploring all possible solutions to determine which is best.

3.1

Brute Force Algorithms

Brute force algorithms exhaustively enumerate all possible solutions to select the one that best satisfies the problem statement. This class of algorithms is usually the easiest to design and, when tractable, is guaranteed to produce an exact solution. For example, let us consider the brute force solution to a basic challenge: the exact string-matching problem. This problem finds the position(s) of a string, or specific sequence of characters, within a longer text. Exact string matching can, for example, be used to find the occurrence of a primer in a fasta file (see Definitions Box), or to search for the presence of a gene

This article is protected by copyright. All rights reserved.

Accepted Article

M

S=ACT Iteration 1 Iteration 2 ...

N

T= ACGT ... GACC ACT ACT

Iteration N-M+1

ACT ACT

Number of Comparisons M M M M

x x x x

1 2 k N-M+1

Fig. 3: Exact string searching in the worst-case scenario. The query sequence S is compared to the text T by iteratively shifting M one position at a time. If M does not occur in T , a total of M × (N − M + 1) comparisons are carried out. name in a collection of scientific articles. Formally, we define the exact matching problem as searching for a string S of length M characters, in a longer text T of length N characters (Gusfield, 1997). In the context of the exact matching problem, a brute force solution consists of iteratively comparing the string S to all the substrings of size M from the complete text T . Using pseudocode, a possible brute force search algorithm is described using Algorithm 1. Algorithm 1 1: function string match brute f orce(S, T ) 2: for each position, i, ranging from 1 to (N − M + 1) in T do 3: if string of T spanning positions i to i + M matches pattern S then return position i 4: . Continue if an exhaustive search is desired or else stop As illustrated in Figure 3, in the worst-case scenario, which occurs when the string is not present in the text, the number of positions in the text T that will be inspected will be at most equal to N . Furthermore, given that at each position, M comparisons are required, the total number of comparison operations that will be carried out is M × (N − M + 1). Note that we can bypass any remaining comparison in S and move to the next iteration as soon as a mismatch is detected. However, we chose not to do so in the algorithm to simplify the task of counting the total number of comparisons.

3.2

Greedy Algorithms

As opposed to brute force exhaustive searches, greedy algorithms do not explore the set of all possible solutions. Rather, they make arbitrary, but intuitively “good” choices at each stage, with the goal of getting an optimal, or at least satisfactory, solution at the end. A greedy approach is most often employed when an optimal solution is intractable; i.e, it would take far too long or too much computing power for the program to complete. To illustrate how greedy

This article is protected by copyright. All rights reserved.

Accepted Article

Original Genome Sequece

a

ACGTATGATTAGCG

R = {ACGTA, ATGAT, ATTAG, TAGCG, GTATG, GATTA} ACGTA

b

c

Concatenation Supertring

Shortest Superstring

ATTAG GTATG ATGAT TAGCG GATTA

ACGTAATGATATTAGTAGCGGTATGGATTA ATTAG GTATG TAGCG ACGTA GATTA ATGAT ACGTATGATTAGCG

Fig. 4: Example of the shortest common superstring problem. (a) A dataset of 6 short reads, each of 5 bases in length. (b) The concatenation of all reads results in a contig (underlined) of 30 bases. (c) However, the shortest superstring (underlined) containing all the reads from R is only 14 bases long. algorithms work, we introduce a problem of seminal importance in bioinformatics: the Shortest Common Superstring problem (Pevzner & Waterman, 1995). This problem is a simple and naive approximation of the DNA sequence assembly process (Pevzner & Waterman, 1995): given a set of short sequencing reads, and assuming no systematic errors, we need to find the shortest genomic sequence from which these reads were derived (See Figure 4). We formally define the problem as follows: given a set R of n sequencing reads r1 , r2 , r3 . . . , rn , find the shortest genomic sequence, G, such that each of the sequencing reads in R appears as a substring in G. Note that by constraining the common superstring to be the shortest possible, we prevent the trivial concatenation of all elements in R from being a valid solution (Figure 4.b). A greedy solution to this problem consists in iteratively selecting and merging the pair of reads sharing the highest overlap similarity. The merged strings are replaced in R by their consensus and the process is repeated until no strings in R can be merged (Ma, 2008). The pseudo-code for the greedy Shortest Common Superstring algorithm is given in Algorithm 3.2. We compare the trade-off in terms of time required for the brute force and greedy solutions to this problem in the Algorithmic Complexity section below. Algorithm 2 1: function greedy SCS(R) 2: while there is more than 1 read in R do 3: Find pair of reads ri and rj with most similarity 4: Add consensus of ri and rj to R 5: Remove ri and rj from R return read in R

This article is protected by copyright. All rights reserved.

Accepted Article

1)

2)

f

f

e

a

7

7

1 5 3

b

5

5.5 3

g

2

1

2

1

5.5

4

5

3

h c

d

h c

Fig. 5: 1) Unrooted phylogenetic tree for set of species X = {a, b, c, d, e, f, g, h}. 2) Subtree with the best phylogenetic diversity score for W = {c, f, h} of size 3 Greedy algorithms have been successfully applied in numerous ecological applications, such as in conservation planning (Steel, 2005). Figure 5 shows an unrooted phylogenetic tree T , whose leaves comprise a set of X species and whose edges are assigned a non-negative phylogenetic distance. The phylogenetic diversity (P D) of a subset of species W from the set X is computed as the sum of edges that connect the species in W . For instance, the P D of the subset of species W = {a, e, f } from Figure 5 is 1 + 5 + 2 + 7 = 15. Given a set X of species that one can conserve from the tree T , and in the presence of certain conservation constraints, such financial cost, survival rates, environmental impacts, etc., one common way to select the best n candidates for conservation is to choose those that are maximally distinct; i.e. they maximize the P D score. A brute force solution for the conservation planning problem discussed above consists of computing the P D score of all subsets of the desired size n. This rather trivial approach is computationally expensive and cannot scale for large sets X and W . A more efficient way for solving this problem consists of using the following greedy approach: start by selecting the pair of most phylogenetically distant taxa, and then sequentially add the taxon that maximizes the P D score until the size of the desired set W is attained. This solution is not only easier to implement than the brute force method, but can also scale to very large values of X and W . The class of algorithms that search only a subset of all the possible solutions is called a heuristic and is often used as an alternative when an exact solution is not tractable. Heuristics cannot guarantee that the best solution will be found, and propose an approximate solution instead. It turns out that due to their complexity, many bioinformatics problems must be solved using heuristics. For example, sequence assembly (Parsons et al., 1993), multiple sequence alignments (Zhang et al., 2000), and maximum likelihood tree recon-

This article is protected by copyright. All rights reserved.

4

Accepted Article

struction (Korostensky & Gonnet, 2000), to list only a few, all rely on heuristics to yield satisfactory solutions in a reasonable time. A satisfactory, or acceptable, solution is problem specific. For example, the greedy shortest common superstring problem has been proven to provide an approximation factor of 4; in other words, the produced shortest common superstring is at most 4 times longer than the best possible solution. In practice, the greedy solution likely yields an approximation factor of 2. Of course, such an approximation factor can still be unacceptable for some problems, but variants of the greedy solution to the shortest common superstring, as implemented in CAP3 (Huang & Madan, 1999) or PHRAP (Bastide & McCombie, 2007), were used to successfully reconstruct a range of genomes. Thus, despite being approximate, some greedy algorithms can often produce a satisfactory solution using fewer resources and - as we shall see in the next section - within far shorter time frames. Furthermore, these solutions are intuitive and easier to design and implement than more sophisticated strategies.

3.3

Algorithmic Complexity

Generally speaking, the number of operations carried out by an algorithm is an important estimate of its efficiency and, indirectly, its run time. For the exact matching problem, we have determined that both the length of the text and that of the pattern influence the number of operations executed, which we can approximate as N × M . For example, in the worst-case scenario, to search a string of 100 bases in length in a DNA text comprised of 100,000 bases will require approximately 100, 000 × 10 = 1, 000, 000 comparisons. If we suppose that each comparison takes one millionth of a second, or 10−6 s, then, the search operation will take 1s to complete. If the same 100 bp string is searched in a human genome, which comprises approximately 3 billion bases, the brute force search algorithm will take 3,000 times as long, or approximately 50 minutes, to complete. This simple way of analyzing the brute force algorithm is a convincing argument of its inefficiency for use on large datasets, such as for mapping millions of reads onto a reference genome. Similar reasoning is commonly employed in the bioinformatics literature to carry out what is referred to in computer science as complexity analysis. The results of such analyses summarize, often using simple formulae, how different algorithms scale to handle large data sets. For example, Edgar et al. (Edgar, 2004) approximate the total runtime complexity of the algorithm implemented in ClustalW, one of the most commonly used multiple sequence alignment programs, as N 4 + L2 where N is the total number of sequences and L is their length. Thus, for 10 sequences, each of 400 bases, the total number of operations can be approximated as 104 + 4002 . Increasing the number of sequences by 10 increases the total number of operations by 100 units. In (Morgenstern, 1999), the authors approximate the complexity of the algorithm implemented in the multiple sequence alignment program DIALIGN as N 4 × L2 . Therefore, one can expect that DIALIGN would be 10,000 times slower than ClustalW in the worst-case scenario. Of course, the runtime alone is

This article is protected by copyright. All rights reserved.

Accepted Article

not indicative of the performance of these programs, and there may be important biological or accuracy reasons to select a more complex, and hence slower, algorithm over a simpler, faster one. The computational complexity of the Greedy Shortest Common Superstring problem can be derived using a similar reasoning. The number of comparisons carried out by the algorithm is a function of the number and the length of the sequencing reads in R. In the first iteration, selecting the pair having the best overlap requires carrying out all the possible pairwise comparisons; r1 versus r2 , r1 versus r3 , etc. Therefore, to compare n reads, a total of (n − 1) + (n − 2) + . . . + 1 = n(n−1) comparisons are required . Each operation requires 2 aligning two reads of average length M , which can be done using a pairwise sequence alignment algorithm (Smith & Waterman, 1981), using approximately M 2 operations in the worst case (Jones & Pevzner, 2004). Therefore, M 2 × n(n−1) operations are required in the first iteration alone. By applying the 2 same reasoning for the remaining iterations, the approximate total number of operations for the greedy Shortest Common Superstring is then M 2 × (n3 − n). With the exhaustive search strategy, the shortest common superstring requires exploring all the possible permutations (ordered overlaps) of the strings before selecting the one that yields the shortest string. This approach, while guaranteed to generate the best solution, requires a number of operations that grows as a factorial of the number of sequencing reads (n!). Thus, for 100 reads, the number of operations would approach 100!, or 9 × 10157 . In fact, the exhaustive search is impractical for even 100 reads, and impossible for genome assemblies. We will discuss alternative solutions to this problem and the associated trade-offs of these approaches in the next section. It is clear from earlier examples that the computational complexity is a good indicator of the efficiency of the algorithm it describes. In essence, an algorithm that can be described using a polynomial function, such as n5 , is still preferable over any other that can be described by a higher order polynomial, eg. n10 , or an exponential function, such as 2n . As an illustration, while the polynomial example, n10 takes 1010 operations to process an input of size n = 10, the nonpolynomial example takes 2100 , or 21 orders of magnitude more operations (See Figure 6). By and large, problems and algorithms are often classified in the literature as polynomial or non-polynomial, with the understanding that algorithms in the non-polynomial class are not tractable (For more details, see definition of “NP-complete” in Section 10). In addition to the brute force and greedy algorithms described here, many other paradigms (eg. Dynamic programming (Eddy, 2004), backtracking (Siragusa et al., 2013), etc.) can be used to solve a problem. However, using computational complexity, we can often show that one strategy is more efficient than the rest, therefore justifying its selection over competing solutions. In some cases, it is preferable to recast the problem in a format that can yield an improvement in the computational complexity. This recasting often requires the use of data structures, which we discuss next.

This article is protected by copyright. All rights reserved.

x x

5

1016

10

x

Approx. number of operations

Accepted Article

Function Growth as a Function of the Problem Size

1018

2

1014

x!

1012 1010 108 106 104 102 100

0

5

10 Problem size

15

20

Fig. 6: Illustration of the growth rates of the functions x5 , x10 , 2x , x!. A function describing the computation complexity of a problem is a good indicator of a how that algorithm scales to handle large inputs. As the problem size grows (x axis), the number of operations carried out by an algorithm with an exponential function (such as 2x ) or one with a factorial function (x!) grows much faster than in algorithms with polynomial functions such as x4 or x10 . For instance, with a problem of size 16, an exponential algorithm requires more than 516 , or more than 152 billion operations, whereas the algorithm with complexity x5 requires approximately 1 million operations for the same input size.

This article is protected by copyright. All rights reserved.

Accepted Article

4

Data Structures

A data structure defines the format under which data is stored and organized. In addition to providing efficient means for managing massive amounts of data, appropriate data structures often allow for the design of more efficient algorithms. To illustrate how data can be organized using a data structure, we revisit the exact string matching problem to show how a simple reorganization of the text can substantially speed-up the search, and explain why differing data structures are required by different programs or applications. The data structure we will use to store the reorganized text is called a suffix tree. It is a structure in which data are organized hierarchically, in manner similar to phylogenetic trees. Suffix trees store information relevant to all the text suffixes, i.e., strings ending the text (Figure 4.a). Specifically, the concatenation of edge labels between the root and a branch tip, or leaf, spell a suffix and each leaf holds the position of the suffix spelled by labels on its path from the root (Figure 4.b). Once the suffix tree is built, the exact matching problem becomes trivial: it requires only to start at the root node and follow the pattern along the edges of the tree. The position(s) of the pattern in the text will be contained in the leaf or leaves at or below the last edge to which the pattern matched (See Figure 4.b for an example). The construction of the suffix tree can be done off-line in a non-prohibitive amount of time (eg. minutes or hours). This data structure transforms the matching problem into a simple threading of the pattern along the edges of the tree. This operation requires, regardless of the size of the text to be searched, at most N comparisons instead of the N × M comparisons required by the brute force algorithm. For example, searching for the exact occurrence of a short sequencing read of 100 bases would require at most 100 comparisons. In fact, this data structure is used by MP-SCAN (Rivals et al., 2009) to map millions of short reads generated by next generation sequencers against a reference genome. That said, selecting an appropriate data-structure necessitates a critical analysis of the trade-offs involved. For example, while a suffix tree achieves substantial gains in terms of speed, its high memory requirements can limit its use to relatively small datasets in the order of few gigabytes in size. Similar to trees, graphs are another well-studied data structure and one that is particularly relevant in bioinformatics and computational biology. Graphs, often referred to as networks by molecular ecologists, are a fundamental computer science concept used to represent structured information. Inherently, a graph depicts a set of items, known as vertices (also called nodes) and the relationships between pairs of these vertices, known as edges (Heineman et al., 2009). Figure 8 shows an example of a graph representing the dispersal between fish populations (circles) in the study landscape, as estimated from the variability of a DNA marker. Vertices are labeled using the locations where the population was surveyed, and direct edges are labeled with the number of individuals that leave a population to join another. This graph highlights high migration within the sub-regions and very low migration across the sub-regions. It is important to remember that this graph does not tell us how the migration

This article is protected by copyright. All rights reserved.

Accepted Article

a Suffix G GG TGG ATGG CATGG ACATGG TACATGG ATACATGG CATACATGG, GCATACATGG TGCATACATGG ATGCATACATGG

b Position 12 11 10 9 8 7 6 5 4 3 2 1

root

CATGG

ACATGG

T

7

G

6 G

ACATGG

5 CATACATG

1

G

T

A

G

10

CAT

CATACATGG

3 CATACATGG

G

11

GG

8

2

G

9

Fig. 7: All possible suffixes of the string ATGCATACATGG, as well as their positions, are listed in table (a). Threading of the pattern TGG matches in the suffix tree as represented by the black nodes. The position of the pattern is indicated by the value at the leaf where the threading ends, i.e., 10. Alternatively, the threading of the pattern AT is represented using the striped nodes. The positions of the pattern in the string are indicated by its descendant leaves, i.e., 1, 9 and 5. rates were computed: it is merely a representation of the underlying structure of the data. Molecular ecology literature abounds with applications of graphs. Such applications range from species connectivity (Moilanen & Nieminen, 2002) to phylogenetic tree reconstruction (Pevzner, 2000), and genome assembly (Parsons et al., 1993; Kececioglu & Myers, 1995). In DNA sequencing, graphs allow us to conveniently recast the shortest super-string problem into one of the most studied problems in computer science: the Traveling Salesman. The question this problem tries to answer is simple: given a list of cities (represented vertices) and the distances between them (stored in the edges), what is the shortest possible route that visits each city exactly once before returning back to the origin? This problem has numerous practical real life applications, such as in routing packets over the Internet and trip scheduling for airline carriers (Ma, 2008). In the case of the DNA shortest common superstring, we build the graph by representing each sequencing read as a vertex and connecting each overlapping pair of sequences using an edge. Each edge is labeled by the overlap length between the pair of reads that it connects. The shortest superstring problem can be reformulated as finding the path in the graph that visits each vertex exactly once and for which the sum of the weights over its edges is maximal. This definition differs only slightly from the traveling salesman problem in that the path, or route, through the vertices need not return to the vertex of origin (Figure 9). The shortest path for the example dataset, as highlighted in red in the figure, describes the order of the overlaps that generate the shortest common superstring, or assembly of reads. The first vertex in the shortest path (with

This article is protected by copyright. All rights reserved.

ACATGG

4

Accepted Article

South_O

East_M

0.11

0.28

0.54

0.54

West_H

0.05

North_H

3 1.10

1 0.1

0.25 0.32

East_H

1

South_K

0.75 1.25

2

2

2

5

South_M

South_H

2

0.08

Fig. 8: An example of a graph representing the dispersal between fish populations (circles) in the study landscape, as estimated from the variability of a DNA marker. 3 1

TAGCG 1 1

End 2

2

ACGTA

3

3

Start

4

3

1

GATTA

GTATG1 Fig. 9: Shortest Common Superstring graph. The shortest path, highlighted in red, describes the order of the overlaps that generate the shortest common superstring. The edges are labeled by the number of overlapping DNA bases between two vertices (sequences). the label start) is ACGTA and the second vertex in the path is GTATG. Both sequences overlap by 3 bases (GTA) and their consensus is ACGTATG. By iteratively building the consensus of all the edges in the shortest path of the graph, we obtain the sequence ACGTATGATTAGCG, as a possible shortest superstring. A large number of heuristics for solving the Traveling Salesman problem have been proposed (Laporte, 2010). Some have not only a better computational complexity, but have also been proven mathematically to generate an optimal solution that is better than what would be expected using the greedy assembly strategy. The increasing enthusiasm for data intensive science, particularly in the geo and life sciences, has fueled a renaissance in data structure research, particularly in the context of big-data and concurrent computing. Nevertheless, variants of trees and graphs remain the most used data structures in bioinformatics and the

This article is protected by copyright. All rights reserved.

Accepted Article

ones that biologists are most likely to encounter in the literature. As such, basic familiarity with these data structures is required, both for understanding how bioinformatics algorithms operate and for efficiently using the growing plethora of publicly available bioinformatics programs. To store the data using a selected data-structure, and before our algorithms can be executed, they first need to be implemented in a programming or scripting language. In the next section, we explore some important concepts of programming languages, while emphasizing the principal distinctions between lowlevel languages such as C++ and high-level languages like Python, Perl and R.

5

Implementing an Algorithm: Compiled Versus Interpreted Languages

From an operating systems’ perspective, a program is a sequence of instructions encoded in binary format (ones and zeros), known as machine language. These instructions are operating system- and CPU-specific and are rarely, if ever, manually generated or edited. Instead, programmers rely on higher-level programming languages, consisting of mnemonics to code instructions using a precise, unambiguous grammar. Programming languages vary in syntax, semantics, and generally fall into two major categories: compiled or interpreted. Programs written in a compiled language are first translated; or compiled, and stored as machine language instructions in files known as executables. These programs are subsequently run on demand by the user without requiring any additional processing. In contrast, interpreted languages - also referred to as scripts - couple the translation and execution steps into a single process termed interpretation; i.e., high-level instructions are translated and executed each time the program is run. Compiled and interpreted languages are largely considered complementary, because they are best suited for different tasks. Consequently, computing platforms have provided both from as early as the 1960’s (Ousterhout, 1998). From a programmer’s perspective, the principal difference between each is best described by the perennial tradeoff between ease of use versus performance. With compiled languages, the programmers are required to handle system-specific details, such as memory management or disk access. In contrast, scripting languages are less concerned with the efficiency and do not typically allow programmers fine control over the hardware. That is to say that compiled languages deliver performance at the cost of additional complexity and a steeper learning curve (Loui, 2008). Interpreted languages, on the other hand, have a learning curve that is much gentler, making them particularly popular with life scientists. Additionally, several recent trends, such as increasingly faster computers and multithreading, have contributed greatly to the adoption of interpreted languages as robust alternatives to compiled languages. The principal beneficiaries have been, without a doubt, Perl and Python as they managed to grab an important share of the bioinformatics applications’ market. This, we conjecture, has been helped

This article is protected by copyright. All rights reserved.

Accepted Article

by their overwhelmingly rich set of functionalities, which greatly facilitates the implementation of complex applications, ranging from pathway modeling to visualization and population genetics (Cock et al., 2009; Stajich et al., 2002). The statistical computing language R (Ihaka & Gentleman, 1996) has also gained widespread popularity in a wide variety of fields ranging from economics to actuarial and social studies. Like Python, R is an interpreted language with a full-fledged set of programming instructions. However, unlike Python, R is most commonly used for statistical analysis because it provides a rich set of core functionality, as well as a broad range of community-contributed libraries for such diverse tasks as exploratory data analyses, food web modeling, or differential gene expression in transcriptomics studies. The popularity of scripting languages is both a blessing and a curse for bioinformatics: on the one hand, the entry barrier for writing/modifying code is at its lowest. On the other hand, scripting has become the reflexive response for implementing new tools and often an inappropriate one for performancehungry applications. For example, it easier for a casual programmer to script a brute force algorithm to search for the inexact occurrence of a motif in set of fasta sequences, when a compiled program that leverages an appropriate data structure (such as suffix trees) would be many orders of magnitude faster.

6

High Performance Computing

Bioinformatics abounds with resource-intensive applications. Some examples are de-novo transcriptome and genome assemblies, protein folding and large phylogenies. These applications have different performance bottlenecks, for example CPU, memory, disk input/output, or a combination of these. Here we discuss the use of high performance computing to handle resource-hungry applications, and visit some of the most important concepts and services such as multi-core processors, cluster and cloud computing. The ongoing developments in molecular biology techniques have greatly emphasized the need for large computational infrastructures that can scale to handle the ever-growing datasets produced in these fields. This dearth in computational power has spurred important advances to push the existing limitations and revamp how large datasets are handled using parallel computing. Parallel solutions to computationally intensive problems can be decomposed using two high-level paradigms: data versus task parallelism. In data parallelism, the input is partitioned and each subset is processed independently. Subsets can either reside on the same machine or can be distributed across physically separated ones. A popular example of data parallelism is BLAST searching. The most simplistic parallel application of BLAST, also referred to as embarrassingly parallel, consists in splitting the input, or query dataset, into n datasets which are then each searched independently on m compute-nodes, where m ≤ n. By contrast, in task parallelism, the problem is split across different tasks, which work collaboratively to resolve it. As an analogy, consider the process of parallelizing the assembly of a jigsaw puzzle across n individuals. One can pseudo-randomly

This article is protected by copyright. All rights reserved.

Accepted Article

split the puzzle pieces across the individuals using, for example, the part of landscape these puzzle pieces belong to (e.g., sky, or green pasture). Given that the partitioning is not guaranteed to be error-free (i.e., what we think is a blue piece from the sky is in fact from the pond that is part of the pasture), individuals will most likely need to communicate and perhaps exchange the pieces amongst themselves in order to complete their assigned task. Consequently, splitting a task parallel problem across n individuals will rarely lead to an n times speed-up, because the inter-processor communication time will account for a non-negligible amount of the total run time. Last but not least, some problems are inherently serial and can therefore be tedious or impossible to parallelize. Traditionally, computers were equipped with a single processing unit. To give the illusion that multiple programs are running in parallel, a processing unit handles tasks pseudo-simultaneously by switching very rapidly between each of them in a procedure known as time-sharing. In contemporary computer architectures, CPUs can contain multiple processing units in a technology referred to as multi-core CPUs. Each core uses a legacy time-sharing model but the availability of multiple cores handles more tasks concurrently. Since its advent, the multi-core technology has packed increasingly more cores per chip and as of this writing; Intel’s( http:www.intel. com) Xeon Phi newest co-processor board will contain 57 cores, allowing for a previously unimaginable level of parallelism within a single machine. Despite being the most affordable level of parallelism, multi-core technology requires that bioinformatics applications be specifically programmed to take advantage of it, using a programming paradigm called multithreading. While there exist software libraries in major programming languages to facilitate parallel programming, the implementation of off-the-shelf functionality can often be complex even for seasoned programmers. Furthermore, due to important differences in how parallelism is implemented across programming languages, parallel versions of the same program in different languages can yield widely different speedups. Parallel applications for which the needs in terms of resources surpass those available on a single machine are usually run on clusters. Compute clusters are connected, low-budget computers that work together as a single system to achieve high performance and availability at a reasonable cost. In an academic context, clusters vary greatly in size. Individual research labs or departments often run personalized small instances, but larger clusters, often comprising hundreds if not thousands of nodes, are frequently made available as a campus resource. Clusters, in spite of being equally able to handle data parallel as well as task parallel applications, rarely offer the best return on investment for users with sporadic high performance needs. Indeed, in addition to the initial financial investment and the ongoing maintenance cost, clusters require an upfront commitment to a specific technology and lack flexibility, because they cannot easily adapt to tasks that require more resources than are available. To circumvent these shortcomings, researchers are gradually embracing cloud services as a viable alternative.

This article is protected by copyright. All rights reserved.

Accepted Article

Cloud computing refers to using a remotely hosted network of managed computing infrastructure to run computation or to store data (Fox et al., 2009). Cloud computing requires no upfront investment, is commonly billed by consumption, and is particularly attractive for low memory data parallel jobs. An increasing number of bioinformatics applications and pipelines are designed to take advantage of cloud infrastructures. Among these are: Crossbow (Langmead et al., 2009; Gurtowski et al., 2012) for alignment and SNP calling on large scale NGS data, Myrna (Langmead et al., 2010) for differential gene expression studies and ClovR (Angiuoli et al., 2011) for microbial sequencing analysis. Most large commercial cloud service providers, such as Amazon’s EC2 (Amazon, 2010), Google Cloud Platform (Inc, 2013) and Rackspace’s Public Cloud (Rackspace, 2014), offer tools to create scalable and highly available IT infrastructures to store, analyze and share terabytes (often petabytes) of data using their cloud services. Some of these providers even host images of popular biological datasets, for example the annotated human genome and microbiome; the 1000 Genomes Project (Via et al., 2010) and the modEncode Database (Birney et al., 2007). In addition to the growing number of private cloud services providers, there exist various public high-performance computational resources that can be leveraged for the analysis of biological data. For example, NSF funded researchers can get access to a startup allocation on XSEDE (Wilkins-Diehr et al., 2008), a virtual system of computational resources in NSF funded centers. Similarly, The Science Clouds project offers resources to the scientific community using an infrastructure paradigm that is similar to Amazons EC2 (Keahey et al., 2008). There are also few smaller initiatives giving access to resources that can be used for a limited set of functions. One such example is the Data Intensive Academic Grid (DIAG, 2013), which offers free access to their infrastructure for running the popular RNA-Seq assembly using Trinity (Grabherr et al., 2011). Furthermore, an increasing number of cyber infrastructure cores within academic institutions across the nation are currently acquiring appropriate computational infrastructure required by the ever-growing computational needs of researchers in the field. The most challenging aspect of data-intensive science remains knowledge extraction, particularly in cases dealing with heterogeneous datasets, where a much broader spectrum of expertise is required. Under these assumptions, algorithm design or high performance computing cannot be readily applied and exploratory hypothesis generating approaches are more appropriate (Hochachka et al., 2007). In what follows we discuss Machine Learning; a popular discipline that offers a plethora of systematic techniques for generating useful hypotheses and for transforming massive amounts of data into actionable knowledge.

7

Machine Learning

Machine learning is a research subfield of computer science concerned with programming computers to extract knowledge from data. For example, one can devise a machine-learning model to: classify RNA sequences as either coding

This article is protected by copyright. All rights reserved.

Accepted Article

or non-coding (Menor et al., 2013); classify normal and cancerous tissues using transcription profiles (Furey et al., 2000); or even predict the target genes of transcription factors (Bauer et al., 2011). Machine learning is commonly divided into two learning schemes: supervised or unsupervised (Bishop & Nasrabadi, 2006). In supervised machine learning, models are trained, or taught, using examples for which the outcomes are known. These models are subsequently used to predict the unknown outcomes for future instances. For the RNA classification example, the program, or model, is presented with unambiguously coding or non-coding sequences and subsequently asked, allowing for some margin of accuracy, to predict the class of unseen sequences. When the outcome is real valued, the supervised learning problem is known as regression, whereas, for categorical or nominal outcomes, the application is known as classification. In unsupervised learning, the goal is to find interesting patterns in the data: this is a less trivial problem since we don’t know, nor can we detect patterns in the data a priori (Murphy, 2012). A canonical unsupervised learning example is clustering of data into groups. This problem consists of partitioning a set of elements into clusters, such that the intra-cluster similarity is maximized and inter-cluster similarity is minimized (Larra˜ naga et al., 2006) (we look at the concept of similarity next). In this case, neither the number of clusters nor the cluster memberships are known a priori. When considering whether an application is suitable for machine learning, a practitioner considers at least the following three important criteria (Domingos, 2012): i) the model type, ii) an objective function to score the model’s ability to generalize to new data and iii) the algorithm used to find the optimal configuration parameters of the model. This process is, to some extent, very similar to the problem statisticians face when trying to find a good mathematical function that best represents a statistical model (Ricci, 2005). We introduce and describe each of these three criteria below.

7.1

The Machine Learning Algorithm

This point is concerned with selecting, among all possible machine-learning algorithms, the one that will be most appropriate for the learning tasks at hand. A recurring classification model in bioinformatics is the decision tree (Kingsford & Salzberg, 2008; Quinlan, 1986). This is a tree data structure whose internal nodes are predictors, and whose leaf nodes are labels for classes. Figure 10 depicts a simple decision tree for predicting the presence of Convict Tang fish (Acanthurus trigosteges) at a site using 5 habitat conditions (predictors): distance to shore (meter), number of coral species (0-5), maximum depth (meter), and average temperature ( ◦ Celsius). The decision tree is built from observed data and is used to predict a response variable given some values for the predictors. This is achieved by threading the input along the tree branches, i.e., answering the question at each internal node using the input values. The input is then assigned the label of the leaf where the threading ends. Thus, according to the decision tree in Figure 10, a geographical region where there are more

This article is protected by copyright. All rights reserved.

Accepted Article

than 4 coral species (coral diversity > 4) and an average temperature of 10 ◦ C is 95% likely to have populations of the Convict Tang fish. On the other hand, a geographical region with a coral diversity of 0, located 14,000 meters from shore and having a depth of 3,000 meters is only 3% likely to host Convict Tang fish. Decision trees are particularly efficient for tackling binary or multi-class classification problems on instances with a small number of features. Literature abounds with machine learning algorithms and each of these can often be applied differently. This variability in application makes finding the best model for a task particularly non-trivial, and we direct interested readers to (Flach, 2012) and (Olden et al., 2008) for a good introductory text on machine learning.

7.2

Objective Function

The objective, or evaluation function, provides a measure for the model accuracy. In regression, for instance, a simple evaluation function consists of computing the squared errors for the predicted values on test data. For classification, a simple evaluation function can consist of counting the correct versus erroneous predictions. Naturally, one is interested in finding the machine learning approach that minimizes the squared errors or maximizes the number of accurate predictions. To this aim, the evaluation function is then used to find the classifier yielding the best outcome with respect to a desired metric. When evaluating the predictive performance of classifiers, the assessment occurs on previously unseen data to avoid any over-fitting. Over-fitting is a poor modeling outcome where the machine learning algorithm recalls previously seen data but fails to generalize to new instances. To avoid over fitting, it is common to use a small subset of the initial data exclusively for testing (commonly 20% of the initial dataset) or a resampling technique if the data is sparse.

7.3

Optimization Algorithm

Once the machine-learning algorithm is selected, it is parameterized, or tweaked, for optimal performance. An analogy for this consists in selecting the appropriate values for all the controls on a mixing-board by a sound engineer. While the engineers prior knowledge on how each knob affects the overall quality of the sound is helpful, his intuition plays a similarly, if not more, important role in optimizing the sound quality. Conceptually, optimization can be thought of as exploring the space of parameters that optimize the output using the objective function. When the space of such options is quasi-infinite as is the case with many complex bioinformatics machine-learning problems it is necessary to apply a strategic approach to render the search tractable. To use the mixing-board analogy, if only 5 knobs are available, each of which can have 10 settings, then there are 105 , or 100,000 possible options. By contrast, parameters in machine learning can be real-valued, which causes the options space to be infinitely large. It becomes critical, therefore, to use heuristics that will produce a near-optimal

This article is protected by copyright. All rights reserved.

Accepted Article

No

No

No

0.01

Yes

0.25

Yes

Yes

No

0.6

No

Yes

0.33

Yes

0.95

0.4

Fig. 10: An input is assigned to a leaf of the tree by answering the questions at each intermediate node. Note that each node of the tree has mutually exclusive and exhaustive outcomes. The diversity of the Convict Tang fish is shown at the leaves. solution by trying only those parameters that are most likely to yield good results. Given that all new inferences are exclusively derived from data, the concept of garbage in, garbage out (Bininda-Emonds et al., 2004) holds particularly true for machine learning. Therefore, to attain the best generalization possible, input is cleaned of any biases that can affect the learning process. Another important facet of machine learning is selection of features, or input variables. Independent features that correlate well with the learning task are preferred over correlated features that do not map to the desired outcomes. Selection of the best features for a novel machine learning problem requires not only a great deal of effort, familiarity with the problem, and a good intuition, but also a great deal of trial and error. As such, data preparation and selection of features are often the most time consuming stages. In some instances, manually selecting among a large set of possible features can be time consuming or impossible. A popular method for overcoming this obstacle is Principal Component Analysis, which we discuss below.

7.4

Principal Components Analysis

Principal Components Analysis (PCA) has been successfully used in a broad range of ecological applications, such as for understanding wildlife-habitat (Morrison et al., 2012), ecological stability (Donohue et al., 2013) or for identifying important patterns in community data (Dray et al., 2012). In brief, PCA is used to represent the data using a smaller number of dimensions while minimiz-

This article is protected by copyright. All rights reserved.

0.04

Habitat Density

Accepted Article

0.06 0.02 0.00

−0.02 −0.04 −0.06 11 10 Dista9n 8 7 ce to O cean 6

5 7

13 12 11 re 10 atu 9 per 8 Tem

Fig. 11: Scatter plot of 100 samples and three variables: temperature, distance to ocean and habitat density. ing the loss of information resulting from such dimensionality reduction. Take for instance the data points representing the temperature (x-axis), distance to ocean (y-axis) and habitat density (z-axis) in Figure 11. To find the principal components, one must identify the axes, or components, that are perpendicular and that best explain the variance in the dataset. Without delving into the mathematical detail, components can be thought of as the set of perpendicular lines which most closely approach all the data points. The solid red line in the scatter plots (Figure 12) represent the line for which the data shows the most variance. The dashed line is the line that captures the second largest variability and runs in a direction perpendicular to the red line. In PCA, these lines are identified in the n-dimensional space, where n is the number of variables in the data, rather than in 2-d space as in the example. Such lines are described using eigenvectors and their corresponding eigenvalues. A formal definition is beyond the scope of this text, but an eigenvector can be thought of as the direction of a line across the data, and the eigenvalue is the scale of the eigenvector, or the extent of the variance it captures. The number of eigenvectors is equal to the number of variables in a dataset and each eigenvector captures a proportion of variance proportional to its eigenvector. In the example above, the three eigenvalues are 24.32, 7.67 and 1.54 . Thus, the first two eigenvectors capture 24.32/(24.32 + 7.67 + 1.54), or approximately 96% of the variance. To reduce the dimensionality of a dataset, one simply projects points onto the new axes. Similarly the original dataset can be obtained by projection back onto the original space. While PCA guarantees to find the components that yield the best reduction in dimensionality, popular approaches to compute PCA, such as those implemented in common statistical or mathematical packages, are computa-

This article is protected by copyright. All rights reserved.

11

20

10

Habitat Density

Distance to Ocean

22

9

8

18

16

14

7 12

6 10

5 6

7

8

9

10

11

12

13

4

14

6

8

10

12

14

Distance to Ocean

Temperature 3

14

Eigenvector 2

2

Temperature

Accepted Article

12

12

10

8

1

0

−1

−2

6

8

10

12

14

16

18

Habitat Density

20

22

−3

−6

−4

−2

0

2

4

6

Eigenvector 1

Fig. 12: Pairwise plots between the variables distance to ocean, temperature and habitat density. The solid red line represents the line most closely approaching all the data points. The dashed blue line captures the second largest variability in a direction perpendicular to the red line. The bottom right graph shows the data in the new coordinate system composed of eigenvector 1 and eigenvector 2 tionally expensive (Abraham & Inouye, 2014). For instance, the Singular Value Decomposition – a linear algebra technique to factorize a matrix into three matrices that contain the information relevant to deriving the principal components (Kalman, 1996) – grows as the minimum between M 2 × N and M × N 2 , where M is the number of sample points and N is the number of variables (Abraham & Inouye, 2014). Hence, carrying out a PCA on data from a RAD-seq experiment containing millions of SNPs and thousands of individuals would be a time consuming, if not a completely impractical task unless randomized heuristics are used (Dixon, 1983; Abraham & Inouye, 2014). Another compelling approach for reducing the dimensionality of a dataset is by using random projections. We discuss this method in Section 8

This article is protected by copyright. All rights reserved.

Accepted Article

8 8.1

Applying Computational Thinking Approximating Sequence Similarity Using Random Projections and Spatial Proximity

DNA sequence clustering, which consists of grouping DNA sequences sharing high similarity, is an important task in bioinformatics. For example, clustering plays a particularly important role in the analysis of RAD-seq data, where the first step consists of grouping the sequences arising from the same locus. Various tools exist to cluster sequences; one such program is CD-HIT (Fu et al., 2012), which uses an incremental clustering algorithm. Briefly, sequences are first sorted in order of decreasing length and the longest one becomes the representative of the first cluster. Sequences are then compared iteratively to the cluster representative and are added to the cluster if their level of similarity with the representative is above a given threshold, or are selected as the representative of a new cluster if they fall below that threshold. One can see that the complexity of this algorithm is a function of the number of clusters; the higher the number of clusters, the more comparisons will be required to cluster a sequence. Most clustering programs use variations on this approach, which despite clever heuristics, remains CPU intensive and requires large amounts of memory to keep track of all the cluster representatives, as well as the input sequences. For example, using CD-HIT (V4.6) to cluster 25M sequences of length 120 bases – which by today’s standards is considered a small dataset – requires 37 GB of RAM to run, therefore limiting its use to large servers. To bypass the large memory requirements, one can perhaps imagine a solution that splits the clustering task into multiple sub-tasks, each having smaller memory requirements. For example, one can pre-filter the data by partitioning it into subsets, such that a sequence in a subset cannot participate in a cluster from a different subset. Each subset can then be independently clustered using existing methods. It turns out that such a solution, in addition to allowing for perfect parallelization of the problem, is far less memory-intensive because the number of cluster representatives in each subset would be smaller than that in the initial dataset. Unfortunately, this solution also requires the computation of all the pairwise alignments to produce an optimal set of non-overlapping partitions. Therefore, a fast heuristic for solving the problem is still needed. It turns out that the sequence-partitioning stage can be reduced to a fundamental data-mining problem; that of finding nearest neighbors. An important instance of the nearest neighbor problem is collaborative filtering (Sarwar et al., 2001), the theory behind most recommender systems, including those used by Netflix http://www.netflix.com and Amazon http://www.amazon.com. Briefly, collaborative filtering consists of using similarity between users to estimate their taste with regards to items such as movies, music, books etc. To do so, individuals are typically represented using a list of variables, commonly called a feature vector, which may include age, city, sex, education level, preferred movie genre, most liked items, etc. The similarity between two individuals is then computed using measures like the Euclidean or the Jaccard distances (Deza

This article is protected by copyright. All rights reserved.

Accepted Article

& Deza, 2009) (distance can be easily transformed into similarity). In bioinformatics, sequences are often transformed into feature vectors of their k-mers (DNA words of length k). For example, the sequence ACGTGAC can be represented using the vector of k-mers, [AC=2, CG=1, GT=1, TG=1, GA=1, XX=0] where XX refers to any other non-listed k-mer. Using k-mer representations, it is easy to approximate the level of similarity between two sequences by computing the Euclidean distance between their k-mer vectors. For the purpose of this example, it is sufficient to highlight that a sequence is more likely to be similar to one with which it shares a large number of kmers than one with which it does not share any k-mers. Before we can use this approximation in the sequence partitioning problem, there are two problems we need to solve: i) the memory requirements for storing all k-mers can be demanding, particularly for large values of k, and ii) we still need to carry an absurdly large number of pairwise Euclidean similarities in memory prior to partitioning the initial dataset. We discuss solutions to these problems in what follows.

8.2

Dimensionality Reduction Using Random Projections

Given that the number of k-mers grows exponentially in terms of k, storing a single sequence would require approximately 1 Megabyte when k=13 (using 1 byte for each of the 413 ≈ 67108864 k-mers). This cannot scale to handle the billions of sequences produced by modern sequencing instruments. There are various methods to reduce the dimensionality of a dataset. One such popular method is PCA. Unfortunately, PCA is too expensive to compute for very high-dimensional data. The Random Projection is another method for dimensionality reduction that has shown great success in Big Data (Bingham & Mannila, 2001). It consists of projecting feature vectors onto a lower dimensionality space. For example, by projecting the point A in Figure 8.a whose coordinates are [x=3, y=2 and z=4], onto the randomly selected lines L1 and L2, we obtain the points of coordinates 7 and 12 respectively. Thus, by projecting a point in 3- dimensional feature space on two randomly selected lines, we were able to reduce the number of features from 3 to 2. One can achieve considerable reduction in dimensionality by projecting sequences in 4k feature space onto a much lower number of randomly chosen lines. Furthermore, according to the JohnsonLindenstrauss lemma (or theorem) (Johnson & Lindenstrauss, 1984), the space dimensionality reduction can be done in such a way that distances between points are nearly preserved (Figure 8.b). In other words, the distances between sequences can still be approximately preserved (given a minor distortion) after reducing from a 13-mer space of 67,108,864, to few hundred dimensions (the details of this lemma are omitted for simplicity but more information can be obtained in (Ailon & Chazelle, 2009).

This article is protected by copyright. All rights reserved.

Accepted Article

8.3

Dataset Partitioning Through Binning

A trivial approach to partitioning sequences is through binning (See Figure 13). The more similar the points, the more likely they will co-occur in the same grid cell. Obviously, whether candidate sequences are placed in the cell depends on the grid layout. For example, in the grid layout in Figure 13.d the points A and B are placed in different grid cells despite being contained within the same cell in other grid configurations. However, by repeating this process multiple times, one can reduce the number of false positives by considering points that occur in the same cell at least a certain a number of times. In fact, it is possible to define the appropriate grid size, as well as the number of times this experiment needs to be repeated in order to guarantee that highly similar sequences will co-occur in the same partition at least once (Feldman et al., 2013). A simple heuristic to the sequence-partitioning problem, and consequently to the loci inference problem in RAD-seq, consists of assigning to the same partition sequences that co-occurred in the same grid cell at least once. Each of the resulting clusters can then be used independently of the other partitions to infer loci.

8.4

Synthesis

This example illustrates how using computational thinking and some clever simplifications, we are able to turn a complex, memory intensive exercise into one with easy parallelization and much lower memory requirements. While this heuristic might not always yield an optimal clustering, selecting a false positive rate can yield satisfactory results for many biological experiments such as RADseq or database searches.

This article is protected by copyright. All rights reserved.

Accepted Article

y A (x=3, y=2, z=4) B (x=4, y=3, z=2) C (x=1, y=2, z=-1) L2

a) C1

x

11

A1

C2

x

XC

7 x

XC

13 x x

6

x

C2

12

L2

b)

B

B2

8

X

A2

B1

2

XA

L1

B2 -1

A2

BX 3

4

(0,0)

x

A1 B1

C1

XA

4

c)

z

d) L2

L2

XC

XC

X

B

X

B

XA

XA

L1

L1

Fig. 13: Dimensionality reduction and sequence clustering by random projects and binning. (a) the points A, B and C are represented in 3-dimensional coordinate system. Their projections on the lines L1 and L2 are represented using the points A1, B1 and C1; and A2 B2 and C2 respectively. (b) The presentation of the points A, B and C in the new reduceddimension coordinate system. To highlight the level of similarity between sequences we superimposed on the coordinate system a unit grid, centered at the origin. After randomly rotating the grid, the points A and B still co-occur in, or bin to, the same grid cell (c), however, some random rotations can result in highly similar points binning to different cells in the grid.

This article is protected by copyright. All rights reserved.

L1

Accepted Article

9

Conclusion

Bioinformatic applications and algorithms are constantly developed to help assimilate and mine the colossal datasets produced in the context of modern high throughput biology. These computational assets are key to understanding natural phenomena and providing the underpinning necessary for integrating subfields of biology into a systems-level research. Unfortunately, and despite the undeniable benefits of its applications in biology, computer science is still viewed largely as a complex, unattainable field by many biologists. In an effort to correct this misconception, here we introduce some simple, yet elementary key concepts of computer science required to build an intellectual structure and understand key contributions and applications of the field. We wrote this paper with the objective of demystifying some basic computer science concepts, debunking some common misconceptions we encounter, and arouse interest among molecular ecologists to independently investigate this exciting and rewarding field. Being a biologist and computer scientist ourselves, we are well aware of the benefits of such collaborations to both, and we hope this review will help foster a more dynamic partnership between computer scientists and biologists.

10

Acknowledgements

This work was supported by a grant from the National Institutes of Health (P20GM103516, G12MD007601) and the National Science Foundation (OCE 12-60169). This is SOEST contribution number 9230 and HIMB contribution number 1605.

This article is protected by copyright. All rights reserved.

Accepted Article

Box 1: Definitions Algorithm: A step-by-step description of the solution to a well-defined problem. An algorithm is often compared to a recipe for carrying out a task using a known number of enumerable and unambiguous steps. Algorithm complexity analysis: The study of the efficiency of algorithms in terms of the amount of operations (or space) that are carried out for a specific input size. Without the complexity analysis, it is not possible to say whether a solution is practical regardless of how creative the algorithm is. Algorithm design paradigms: A methodological framework, or blueprint, that describes the steps required for the construction of efficient solutions to a wide-range of problems, including the mathematical framework that describes the complexity of these solutions in terms of execution time, required memory, disk space, etc. Approximate solution: Refers to solutions that are often associated with intractable problems. Approximate solutions run in a reasonable time and usually offer theoretical guarantees of the near optimality of their solution. For example, the approximate solution to the greedy shortest common superstring problem is at most 4 times longer than the optimal solution. Complexity: While some problems are theoretically solvable, their solutions can require an unreasonable amount of resources. Complexity is often used to describe the practicality of solutions to computational problems and how these solutions scale with larger inputs. The complexity of a problem is often described using a function that highlights how the number of operations carried out in the solution grows with the input size. However, in some cases, the complexity is summarized using the class to which the function belongs (eg. linear, polynomial, exponential, factorial, etc.). Data structure: A way of organizing data in a computer so that it can be read or written efficiently. Examples of useful data structures are trees where the information is organized in a hierarchical way, and graphs, where the information is encoded in vertices (nodes) and edges. Decision trees: A popular classification and regression technique which creates a tree-like model that predicts the value of a target variable – stored in the leaves – by modeling the decision rules – stored in the internal nodes – that are learned from the data. Embarrassingly parallel solution: Also known as perfect parallelization, this is a term for a solution in which the data are simply divided into a number of independent tasks, each of which can be executed as a different process. Fasta: A plain-text format representing nucleotide or peptide sequences using DNA or amino acid letters. In a Fasta file, each sequence record is represented using a line that starts with the “¿” symbol, followed by the sequence id and an optional description, and one or more lines of sequence data. Traditionally, each Fasta file produced in DNA-sequencing experiments was accompanied with a Fasta-Qual file, a Fasta formatted file containing the quality score as a number between 0 and 60 for every DNA base of the sequence (http://en.wikipedia. org/wiki/FASTA_format).

This article is protected by copyright. All rights reserved.

Accepted Article

Fastq: A plain-text format that combines nucleotide sequences and their associated quality scores. Each sequence is represented using four lines. The first line starts with the “@” character and contains the sequence id and an optional description; the second line contains the nucleotide data; the third line starts with the “+” optionally followed by the sequence id; and the fourth line contains quality information where each base of the sequence is encoded using a single character. Encoding the quality using a single character allows for significant space saving compared to the Fasta-Qual files (http://en.wikipedia. org/wiki/FASTQ_format). Feature vector: A collection of values describing each input in terms of variables. For example, the feature vector of a geographical site in terms of its latitude, longitude, average yearly temperate can be: (21.3N, 157.8W, 28 ◦ C). Heuristic: A strategy that uses clever simplifications to drastically reduce the space of solutions that needs to be explored. While heuristics cannot guarantee optimal solutions, these strategies usually provide solutions that are good enough most of the time (Romanycia & Pelletier, 1985). Intractable: In computational theory, intractability refers to problems that can be resolved theoretically, but for which an exact solution takes an infinite amount of resources, such as time, RAM or CPU. The reconstruction of phylogenetic trees using parsimony is a good example of an intractable problem. An optimal parsimonious phylogeny needs to consider a very large space of solutions, even for a small number of species. Multithreading (concurrent computing): Consists of partitioning a program into subparts that can run concurrently on multiple CPU cores. Non-deterministic polynomial, or NP-complete: By far, the most important and perhaps controversial advance in theoretical computer science came in the 1970’s out of the work of Cook and Levin (Cook, 1971; Levin, 1973). This discovery stipulated that for a certain class of problems within those for which non-polynomial algorithms exist, a polynomial-time exact solution can only be obtained if we use a non-deterministic computer: a machine with an infinite number of processors that can, almost magically provide us with the solution. This subclass of problem is called, non-deterministic polynomial, or NP-complete. As observed by McGeoch (McGeoch, 2007), “If a problem is NPcomplete, finding an algorithm for it that is always both fast and optimal, or showing that no such algorithm can exist, would be analogous to finding a cure for the common cold: fame and fortune would be yours.” (p. 26). Optimization Algorithm: A technique to find the minimum values of mathematical functions. Typically, optimization algorithms are executed iteratively by selecting values using a predetermined scheme until an optimum or a satisfactory solution is found. Program Control Structure: A statement that alters the execution sequence of the program by determing the order in which other statements are executed. Common programming control structures are if-else statements and for and while loops. Pseudocode: A high-level natural language used to unambiguously describe an algorithm. Pseudocode is intended to be read and understood by a human

This article is protected by copyright. All rights reserved.

Accepted Article

and needs, therefore, to be translated into programming language before it can be executed by a computer. String: A concatenation of a finite number of characters. For example, DNA sequences are strings from the four letter alphabet {A,C,G,T}. Suffix trees: A data structure which stores all the suffixes, i.e., strings ending a text. The information stored in suffix trees is organized hierarchically in a similar manner to phylogenetic trees. Unsupervised Learning: A type of machine learning used to draw inferences in the absence of target values, or “ground truth.” A popular unsupervised learning example in ecology is clustering. It involves grouping similar items such that the distance between items of the same cluster is maximized while the distance between items from different clusters is minimized.

This article is protected by copyright. All rights reserved.

Accepted Article

11

Additional Resources

1. Introduction to Computation thinking Harel, David. Computers Ltd: “What They Really Can’t Do.” Oxford University Press. (2000). Lee, Kent D., and Steve Hubbard. “Data Structures and Algorithms with Python.” (2015). 2. Graphs and networks Zhou, Jizhong, et al. “Functional molecular ecological networks.” MBio 1.4 (2010): e00169-10. Mason, Oliver, and Mark Verwoerd. “Graph theory and networks in biology.” Systems Biology, IET 1.2 (2007): 89-119. 3. Machine learning Olden, Julian D., Joshua J. Lawler, and N. LeRoy Poff. “Machine learning methods without tears: a primer for ecologists.” The Quarterly review of biology 83.2 (2008): 171-193. Smith, Lindsay I. “A tutorial on principal components analysis.” Cornell University, USA 51 (2002): 52. 4. Software design and implementation Wilson, Greg. “Software carpentry.” Computing in Science & Engineering 8 (2006): 66. http://software-carpentry.org/ Libeskind-Hadas, Ran, and Eliot Bush. “Computing for Biologists: Python Programming and Principles.” (2014). 5. Linux and software pipelines Ray, Deborah S., and J. Eric. “Ray. Unix and Linux: Visual QuickStart Guide.” (2009). 6. High performance computing Eijkhout, Victor. “Introduction to High Performance Scientific Computing.” (2010). https://bitbucket.org/VictorEijkhout/hpc-book-and-course Parallel computing explained: Cyberinfrastructure Tutor at the National Center For Supercomputing Applicationshttps://www.citutor.org/ browse.php 7. General computing concepts Brookshear, J. Glenn. “Computer science: an overview.” (2004). Haddock, Steven Harold David, and Casey W. Dunn. Practical computing for biologists. Sunderland, MA: Sinauer Associates, 2011. http://practicalcomputing.org/

This article is protected by copyright. All rights reserved.

Accepted Article

References Abraham G, Inouye M (2014) Fast principal component analysis of large-scale genome-wide data. PloS one, 9, e93766. Ailon N, Chazelle B (2009) The fast johnson-lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39, 302–322. Amazon E (2010) Amazon elastic compute cloud (Amazon EC2). http://aws. amazon.com/ec2/. Angiuoli SV, Matalka M, Gussman A, et al. (2011) Clovr: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC bioinformatics, 12, 356. Bastide M, McCombie WR (2007) Assembling genomic dna sequences with phrap. Current Protocols in Bioinformatics, pp. 11–4. Bauer T, Eils R, K¨ onig R (2011) Rip: the regulatory interaction predictora machine learning-based approach for predicting target genes of transcription factors. Bioinformatics, 27, 2239–2247. Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250. ACM. Bininda-Emonds OR, Jones KE, Price SA, Cardillo M, Grenyer R, Purvis A (2004) Garbage in, garbage out. In Phylogenetic supertrees, pp. 267–280. Springer. Birney E, Stamatoyannopoulos JA, Dutta A, et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the encode pilot project. Nature, 447, 799–816. Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol. 1. springer New York. Brookshear JG, Brookshear JG (2003) Computer science: an overview, vol. 8. Addison-Wesley. Cock PJ, Antao T, Chang JT, et al. (2009) Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422–1423. Cook SA (1971) The complexity of theorem-proving procedures. In Proceedings of the third annual ACM symposium on Theory of computing, pp. 151–158. ACM. Deza MM, Deza E (2009) Encyclopedia of distances. Springer.

This article is protected by copyright. All rights reserved.

Accepted Article

DIAG (2013) Data Intensive Academic Grid. http://diagcomputing.org/. Dixon JD (1983) Estimating extremal eigenvalues and condition numbers of matrices. SIAM Journal on Numerical Analysis, 20, 812–814. Domingos P (2012) A few useful things to know about machine learning. Communications of the ACM, 55, 78–87. Donohue I, Petchey OL, Montoya JM, et al. (2013) On the dimensionality of ecological stability. Ecology letters, 16, 421–429. Dray S, P´elissier R, Couteron P, et al. (2012) Community ecology in the age of multivariate multiscale spatial analysis. Ecological Monographs, 82, 257–275. Eddy SR (2004) What is dynamic programming? 909–910.

Nature biotechnology, 22,

Edgar RC (2004) Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics, 5, 113. Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. SIAM. Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press. Fox A, Griffith R, Joseph A, et al. (2009) Above the clouds: A berkeley view of cloud computing. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS, 28. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28, 3150–3152. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914. Grabherr MG, Haas BJ, Yassour M, et al. (2011) Full-length transcriptome assembly from rna-seq data without a reference genome. Nature biotechnology, 29, 644–652. Gurtowski J, Schatz MC, Langmead B (2012) Genotyping in the cloud with crossbow. Current Protocols in Bioinformatics, pp. 15–3. Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press. Heineman GT, Pollice G, Selkow S (2009) Algorithms in a Nutshell. O’Reilly Media, Inc.

This article is protected by copyright. All rights reserved.

Accepted Article

Hochachka WM, Caruana R, Fink D, et al. (2007) Data-mining discovery of pattern and process in ecological systems. The Journal of Wildlife Management, 71, 2427–2437. Huang X, Madan A (1999) Cap3: A dna sequence assembly program. Genome research, 9, 868–877. Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics. Journal of computational and graphical statistics, 5, 299–314. Inc G (2013) Google Cloud Platform. https://cloud.google.com/. Johnson WB, Lindenstrauss J (1984) Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26, 1. Jones NC, Pevzner P (2004) An introduction to bioinformatics algorithms. MIT press. Kalman D (1996) A singularly valuable decomposition: the svd of a matrix. The college mathematics journal, 27, 2–23. Keahey K, Figueiredo R, Fortes J, Freeman T, Tsugawa M (2008) Science clouds: Early experiences in cloud computing for scientific applications. Cloud computing and applications, 2008, 825–830. Kececioglu JD, Myers EW (1995) Combinatorial algorithms for dna sequence assembly. Algorithmica, 13, 7–51. Kingsford C, Salzberg SL (2008) What are decision trees? Nature biotechnology, 26, 1011–1013. Korostensky C, Gonnet GH (2000) Using traveling salesman problem algorithms for evolutionary tree construction. Bioinformatics, 16, 619–627. Langmead B, Hansen KD, Leek JT, et al. (2010) Cloud-scale rna-sequencing differential expression analysis with myrna. Genome Biol, 11, R83. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for snps with cloud computing. Genome Biol, 10, R134. Laporte G (2010) A concise guide to the traveling salesman problem. Journal of the Operational Research Society, 61, 35–40. Larra˜ naga P, Calvo B, Santana R, et al. (2006) Machine learning in bioinformatics. Briefings in bioinformatics, 7, 86–112. Leiserson CE, Rivest RL, Stein C, Cormen TH (2001) Introduction to algorithms. The MIT press. Levin LA (1973) Universal sequential search problems. Problemy Peredachi Informatsii, 9, 115–116.

This article is protected by copyright. All rights reserved.

Accepted Article

Loui RP (2008) In praise of scripting: Real programming pragmatism. Computer, 41, 22–26. Luscombe NM, Greenbaum D, Gerstein M (2001) What is bioinformatics? a proposed definition and overview of the field. Methods of information in medicine, 40, 346–358. Ma B (2008) Why greed works for shortest common superstring problem. In Combinatorial pattern matching, pp. 244–254. Springer. Marx V (2013) Genomics in the clouds. Nature methods, 10, 941–945. McGeoch CC (2007) Experimental algorithmics. Communications of the ACM, 50, 27–31. Menor M, Baek K, Poisson G (2013) Multiclass relevance units machine: benchmark evaluation and application to small ncrna discovery. BMC genomics, 14, S6. Moilanen A, Nieminen M (2002) Simple connectivity measures in spatial ecology. Ecology, 83, 1131–1145. Morgenstern B (1999) Dialign 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211–218. Morrison ML, Marcot B, Mannan W (2012) Wildlife-habitat relationships: concepts and applications. Island Press. Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press. Olden JD, Lawler JJ, Poff NL (2008) Machine learning methods without tears: a primer for ecologists. The Quarterly review of biology, 83, 171–193. Ousterhout JK (1998) Scripting: Higher level programming for the 21st century. Computer, 31, 23–30. Pabinger S, Dander A, Fischer M, et al. (2014) A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in bioinformatics, 15, 256–278. Parsons RJ, Forrest S, Burks C (1993) Genetic algorithms for dna sequence assembly. In ISMB, pp. 310–318. Pevzner P (2000) Computational molecular biology: an algorithmic approach, vol. 1. MIT press Cambridge. Pevzner P, Shamir R, et al. (2009) Computing has changed biologybiology education must catch up. Science, 325, 541.

This article is protected by copyright. All rights reserved.

Accepted Article

Pevzner PA, Waterman MS (1995) Open combinatorial problems in computational molecular biology. In Theory of Computing and Systems, 1995. Proceedings., Third Israel Symposium on the, pp. 158–173. IEEE. Quinlan JR (1986) Induction of decision trees. Machine learning, 1, 81–106. Rackspace (2014) Rackspace Public Cloud. cloud/.

http://www.rackspace.com/

Ricci V (2005) Fitting distributions with r. Contributed Documentation available on CRAN. Rivals E, Salmela L, Kiiskinen P, Kalsi P, Tarhio J (2009) Mpscan: fast localisation of multiple reads in genomes. In Algorithms in Bioinformatics, pp. 246–260. Springer. Romanycia MH, Pelletier FJ (1985) What is a heuristic? Computational Intelligence, 1, 47–58. Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285–295. ACM. Siragusa E, Weese D, Reinert K (2013) Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic acids research, 41, e78–e78. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. Journal of molecular biology, 147, 195–197. Stajich JE, Block D, Boulez K, et al. (2002) The bioperl toolkit: Perl modules for the life sciences. Genome research, 12, 1611–1618. Steel M (2005) Phylogenetic diversity and the greedy algorithm. Systematic Biology, 54, 527–529. Valle D, Berdanier A (2012) Computer programming skills for environmental sciences. Bulletin of the Ecological Society of America, 93, 373–389. Via M, Gignoux C, Burchard EGG (2010) The 1000 Genomes Project: new opportunities for research and social challenges. Genome medicine, 2, 3+. Wilkins-Diehr N, Gannon D, Klimeck G, Oster S, Pamidighantam S (2008) Teragrid science gateways and their impact on science. Computer, 41, 32–41. Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning dna sequences. Journal of Computational biology, 7, 203–214.

This article is protected by copyright. All rights reserved.

Demystifying computer science for molecular ecologists.

In this age of data-driven science and high-throughput biology, computational thinking is becoming an increasingly important skill for tackling both n...
1008KB Sizes 4 Downloads 9 Views