Computers in Biology and Medicine 43 (2013) 2028–2035

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Accelerating in silico research with workflows: A lesson in Simplicity Paul Walsh a, John Carroll a, Roy D. Sleator b,n a b

nSilico LifeSciences, Ltd., Melbourne Building, Bishopstown, Cork, Ireland Department of Biological Sciences, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland

art ic l e i nf o

a b s t r a c t

Article history: Received 3 December 2012 Accepted 12 September 2013

Bioinformatics is the application of computer science and related disciplines to the field of molecular biology. While there are currently several web based and desktop tools available for biologists to perform routine bioinformatics tasks, these tools often require users to manually and repeatedly co-ordinate multiple applications before reaching a result. In an effort to reduce time and error, workflow tools have been developed to automate these tasks. However, many of these tools require expert knowledge of the techniques and supporting databases which more often than not lies outside the scope of most biologists. Herein, we describe the development of sequence information management platform (Simplicity), a workflow-based bioinformatics management tool, which allows non-bioinformaticians to rapidly annotate large amounts of DNA and protein sequence data. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Bioinformatics Computational biology workflow annotation Usability Genomics Simplicity

1. Background 1.1. Introduction Some of the most common goals in molecular biology, particularly in the post-genomics era, areto quickly and accurately identify genes in a genome/metagenome [1], to ascribe a putative role for each gene [2], determine the structure of the encoded protein, and ultimately to ascertain the function of the predicted protein [3], often leading to the discovery and development of new or improved therapeutics. Bioinformatics has become a popular alternative to expensive and laborious wet lab activities, allowing biologists to quickly form a testable hypothesis about what a protein may be and/or what role it likely plays [4]. However, bioinformatics tasks often require biologists to manually and repeatedly co-ordinate multiple tools to produce a result as outlined in Fig. 1. Data transfer, between applications, is either by manual cut-and-paste or in more advanced cases by ‘screen-scraping’ web pages using scripting languages like PERL, often with additional data ‘massaging’ (e.g., small alterations in formatting, selections of subsets, and simple local transformations such as DNA-to-protein translation) [5,6]. Furthermore, a majority of contemporary bioinformatics tools fail to provide a reliable record – results from one website are simply copied and pasted into another, with no record of important parameters such as algorithm settings, time stamps or database versions used [7].

n

Corresponding author. Tel.: þ 353 21 4335405; fax: þ 353 21 4326851. E-mail address: [email protected] (R.D. Sleator).

0010-4825/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2013.09.011

Typically, the first step in a program of research performed by a biologist is to use a gene prediction tool to find the open reading frames (ORF) in a genome/metagenome [2]. Once the ORFs are determined a biologist must perform a sequence similarity check for each ORF found [8]. A sequence similarity tool compares the sequence being analyzed against a database of known DNA or protein sequences (such as GenBank for DNA or the UniProtKB/Swiss-Prot protein database). One of the most popular sequence similarity tools is BLASTX [9]. The biologist must copy each ORF into BLASTX and select the databases to search and the scoring matrix or algorithm to use. A process which must be repeated for every ORF found. Some protein functions can be quickly identified due to their high sequence similarity to proteins whose function has already been identified and experimentally proven in a ‘wet’ lab – a process known as homology based transfer. A high sequence similarity suggests that the protein sequence being analyzed (the query sequence) is similar to a known sequence (the subject), and since structure informs function, a putative function can be ascribed to the query sequence [10]. ORFs whose functions are not identified by sequence similarity against the primary databases must undergo further analysis using various methods in order to determine the function of the protein being analysed. Motif searching (patterns and profiles) using tools such as Prosite will help to identify highly conserved signature sequences which may provide clues as to the protein’s evolutionary origins [11]. If no sequence homologies exist, genomic context or expression based systems such as Phydbac2 may be used. Furthermore, at least in the case of proteins, structure based approaches such as FATCAT, VAST and FAST for full 3D structure or PROCAT for 3D structure motifs can be used [10]. In essence, the task of ascribing a function to each gene in the genome/metagenome involves a multistep workflow which unnecessarily ties-up the

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

biologist – distracting them from the wet lab experimentation for which they are properly trained [12]. 1.2. Motivation ‐ The need for Bioinformatics Management Tools Workflows, such as the one described above can be labour intensive, error prone, untraceable and often result in the generation of significant amounts of data which the biologist must organise themselves [13,14]. Annotating sequence sets manually using the available online tools can therefore be quite labour intensive and, depending on the genome/metagenome size, may take several hundred man hours to complete [15]. Early attempts at automating bioinformatics tasks included using screen scraping scripts to create pipelines [16], however Hyper-text mark-up language) (HTML) based bioinformatic web interfaces were for the most part designed to be used by humans not scripts. This technique proved troublesome as web pages are occasionally redesigned forcing the programmer to rewrite the script to enable it to work on the new web page. Furthermore, most early bioinformatics workflow tools were developed for specialist bioinformaticians, often based on a UNIX platform using command line software [17]. While this approach remains popular, it is often a difficult transition for non-computer savvy biologists venturing into the bioinformatics arena for the first time, and as such, requires a significant time investment. During the late 1990 and the early 2000 many bioinformatics workflow developers turned to developing GUI (Graphical User Interface) (GUI) applications, while others began developing web based tools using recently developed WEB 2.0 technologies [18]. These applications, or tools, allow researchers to visually build a workflow by selecting several components, each of which performs a separate task in the workflow. Complex and powerful workflows can be created saving the researcher both time and effort.

2029

applications that employ local resources and web services for workflow execution in the application, there are also web based workflow tools built using Web 2.0 technologies. In this case the workflow execution is performed on a server while a browser is used to create workflows and to review workflow results. Relatively few rich internet application (RIA) bioinformatics workflow tools have been developed including Microsoft’s general workflow tool, Popfly, which has been discontinued and Calvin [19]. RIA are embedded in an internet browser and have similar behaviour to desktop applications enabling sophisticated user interactions, client-side processing, asynchronous communications, and multimedia [20]. In general, existing bioinformatics workflow tools can be grouped into one of three categories: those designed for (i) the expert bioinformatician, (ii) biologists with some bioinformatics expertise and (iii) biologists with little or no bioinformatics acumen. Both Taverna and Biobike are aimed at the first group; users experienced in writing workflow software in Perl, Phyton or Lisp, yet desirous of an easier and faster way to create workflows without the need to program. This group would be aware of the web services available, the different data formats used and how results need to be transformed to suit the input of another web service. Galaxy and Ugene are aimed at the second group, users who have experience using bioinformatics tools but don’t have the technical knowledge to write workflows in Perl, Phyton or other relevant languages. These users are aware of the different online tools available and of some of the data formats but would rather use an automated tool to handle the data transformation. Finally, GenomeQuest, Bioextract and Weblab are aimed at all levels of expertise, from the novice to the expert user. GenomeQuest allows bioinformaticians to write workflows in a scripting language called smarty, while biologists who have no experience of writing workflows can simply request GenomeQuest to generate the workflows (see Table 2 for an overview of each of the seven workflow tools mentioned above).

1.3. Bioinformatics Workflow Tools Several bioinformatic workflow tools, both open source and commercial, are currently available (Table 1). While some are desktop

Fig. 1. Overview of the typical workflow which a Biologist– might use to predict the function of a protein (modified from [27]).

2. Simplicity architecture Simplicity is a bespoke bioinformatics management system that allows biologists, with little or no computing background, to manage and analyse information generated from large scale genomic/metagenomic sequencing projects [1]. Currently Simplicity incorporates a number of the most common bioinformatics tools, including Gene Prediction (Glimmer, EMBOSS GetOrf), Similarity Searching (EBI NCBI Blastp, RSCB PDB Blast Search, Interproscan, Pfam and CATH), Multiple Sequence Alignment (ClustalW2), Phylogenetics (PHYLIP fProtdist, PHYLIP fNeighbor, PHYLIP fproml and PHYLIP fprotpars), Primer prediction (EPRIMER3) and Genome Mapping (Gview). More tools are currently being added to meet individual user requirements as well as being tailored for specific projects. Simplicity was developed using an evolutionary prototyping software development model [21]. This approach implements only confirmed requirements from biologists. Evolutionary prototyping involves implementing well understood requirements in a rigorous fashion and writing the code in a way that is easily modifiable. The prototype then evolves as unknown

Table 1 The most commonly used tools for workflow design and execution. Desktop applications

Web based tools

Taverna – www.Taverna.org.uk Ugene – http://ugene.unipro.ru Wildfire – http://wildfire.bii.a-star.edu.sg/index.php Triana – www.trianacode.org Kepler – http://kepler-project.org Pipeline Pilot – http://accelrys.com/products/pipeline-pilot (commercial)

WebLab – http://weblab.cbi.pku.edu.cn Bioextract – http://bioextract.org Galaxy – http://galaxy.psu.edu Biobike – www.biobike.org Ergatis – http://ergatis.sourceforge.net Genomequest – www.genomequest.com (commercial)

2030

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

requirements are discovered during the development process through the use of cognitive walkthroughs and user feedback. In consultation with over 120 biologists a user-centred development approach was employed to design and develop the prototype. Users were involved at the beginning for requirements gathering, they were involved in the GUI design and finally the user testing phase of the project. Various requirements gathering techniques were employed, including interviews, focus groups and surveys, to elicit information from users. These techniques drew information from users about the tools they use as well as their attitudes towards these tools and how result presentation could be improved upon. A competitor analysis of bioinformatic tools (listed in Table 2), in conjunction with our tool (Simplicity) was undertaken to assess how these tools allowed users to create, run workflows and review workflow results (Table 3). The GUI was implemented in a declarative language called XAML, the code behind the GUI being implemented using C#. GUI design was performed in line with Nielsen's 10 Usability Heuristics [22], which help to avoid common usability problems. Nielsen's guidelines include the need to allow user control and freedom, the use of consistency and standards, error prevention, flexibility and efficiency of use, aesthetic and minimalist design and help users recognise, diagnose, and recover from errors. The architecture is divided into several main components (illustrated in Fig. 2), each performing a specific function. The Simplicity client (or browser) allows the user to create and run workflows, run previously created workflows, called templates, and also view workflow results. The client communicates with the server via rich internet application (RIA) services, this provides

application logic to control access to data using queries, and also provides custom operations and user authentication. The advantage of RIA is that much of the client side work is done by the embedded plug-in and only the data is downloaded when needed from the server, freeing the server to do other work. Simplicity uses the Microsof Silverlight RIA framework (with Microsoft Visual Studio 2010 Integrated Development Environment [IDE] used to write the C#, XAML and XML code). The ability of Silverlight to have multi-threading on the client side allows for complex user interactions which are not available in other client side plug-ins. This also gives users the option of using the plug-in in the browser or as an application on the desktop; both options see the plug-in act as a client. Furthermore, the majority of bioinformatics tools are developed for the Unix environment. The Simplicity server side infrastructure includes a Ubuntu server where bioinformatics tools can be installed. The Windows server manages the workflow and offloads data to be analysed to a Ubuntu server via web services. The Silverlight plug-in is updated to include a form for each installed tool. All data from and to the client goes through RIA services. The workflow engine looks at each component or tool in the workflow and checks to see if there are any parallel splits and if so creates a new thread to handle a split. Where there is a sequence of tools or components in a workflow the engine waits until all the work is done for a tool before calling the next tool in sequence. The workflow engine passes the workload of a tool on to the web service switch which then decides what web service enactor to call and forwards the work on to the enactor. The web service

Table 2 Overview of the most popular bioinformatics workflow tools in common use. Tool Name

Implementation

Weblab

Developed using Java and uses Apache Tomcat as container for Java Servlets/JSP and MySQL (http://www.mysql.com) as backend database system to store user data and other information. Graphviz (http://www.graphviz.org) is used to produce Figures/graphs and Lucene (http://lucene.apache.org) as an information retrieval library to build, index and search information. Taverna Developed using Java and uses a heavily modified version of Freefluo (http://freefluo.sourceforge.net/) to orchestrate web services. Taverna incorporates many publicly available web service packages such as Soaplab, Biomart, Kegg, EMBOSS, Biomoby and web services from EBI, NCBI and many other providers. Classes/components called ‘shims’ are used to transform data. Galaxy The Galaxy core components and operation libraries are written in C. User history and a limited amount of results per user are stored in a MySQL database. The primary format that Galaxy uses to store query results is the Browser Extensible Data (BED) format which is used at University of California, Santa Cruz (UCSC) for the Genome Browser. BioBike BioBike is built using the Lisp language. Behind the scenes, expressions created by the user are translated into Lisp and compiled, yielding code that runs at a speed comparable to that of C code. BioExtract BioExtract Server workflows are executed using Sun's Java Message Service (JMS). The JMS allows application components based on the Java 2 Platform Enterprise Edition (J2EE) to create, send, receive, and read asynchronous messages between each other component. Workflows instructions and results are saved in a MySQL database. Ugene UGENE is written with open-source QT4 C þ þ multi-platform library and QtScript scripting language. Ugene has a core component and makes use of plug-ins to add new features. GenomeQuest Workflows are created by phoning/emailing GenomeQuest where a workflow is created by the company for the user. Users can create their own workflows if GeneomeQuest has been installed at their company/institute, but they need to have knowledge of the Unix operating system, web development scripting language PHP, web template system Smarty and GenomeQuests application programming interface (API).

Table 3 Competitor analysis of the most popular bioinformatics workflow tools. Feature examined

Simplicity

Galaxy

Taverna

BioBike

BioExtract

GenomeQuest (C)

Ugene

Can workflows be shared with other users? Are workflow results stored? Can results be shared with other users? Are there advanced settings for each task e.g. change matrix or gap costs? Is there traceability between tasks (data provenance)? Is there data encryption over the internet (https)? Can a user add a web service to the tool? Is there documentation on tool use? Are there tutorials or videos on how to use tool? Is there online support e.g. forum, wiki?

No Yes Yes Yes Yes No No Yes Yes Yes

Yes Yes Yes Yes Yes No No No Yes No

Yes Yes Yes No No No Yes Yes Yes Yes

Yes No Yes No No No No Yes Yes Yes

Yes Yes Yes Yes Yes No No Yes Yes No

Yes Yes Yes Yes No Yes Yes Yes Yes Yes

Yes Yes Yes Yes No No No Yes No Yes

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

2031

Fig. 2. A high level overview of the Simplicity architecture. The client is used to create, run and view workflows. The rich internet application (RIA) services simplify communication between client and server. A workflow engine executes workflows. Web service enactors for each web service, manage the web service calls, save results and manage errors. The web services switch calls the appropriate web service enactor when needed.

switch is implemented using a factory pattern; its purpose is solely to create objects. The enactors manage all the work to be done by each web service, e.g. if there are 150 separate BLAST searches to perform, this enactor will, in sequence, request a web service to do the work, wait for the work to be done and then store the result. Each enactor inherits the functionality of a database access class giving the enactor access to all the database tables. The database is used to store all data about a workflow and also stores all the results a workflow produces.

settings data or the UI while the client is running, rather than stopping the program and changing the GUI using Visual Studio and then rebuilding and redeploying the software. When a settings GUI has been created and displayed by the client, the user must input data and select the desired tool settings, both the UI data and tool settings data must then be saved. This is done by serialising the tool settings GUI and creating a new string with the tool settings and UI data. The settings data is extracted and stored in an XML format. The classes used to do this in all three web pages presenting a workflow as shown in Fig. 3.

3. Simplicity use and evaluation

3.2. Reviewing results in Simplicity

3.1. Creating a workflow in Simplicity

The user will receive an email from Simplicity informing them when their workflow has completed. Clicking on the results link in the main menu loads the results listing page from which the user selects the workflow to review. Fig. 4a shows the workflow results page for the analysis of the Ardmore phage genome [23], analysed using three tools: EMBOSS GetOrf, EBI Blastp and InterProScan. Clicking the icon for each tool in the workflow displays a child page showing the results for the web service; Fig. 4b shows the results for EBI BLASTP which consists of three different areas: Blast results hit list, Blast hits description and Blast alignments. The results for EMBOSS GetOrf shown (Fig. 4c) are returned in a plain text format and are presented to the user in a richtextbox. While the RSCB PDB results (Fig. 4d) are also retrieved from the database in an XML format and the data is parsed and inserted into a datagrid. The picture is downloaded directly from PDB by the client by creating a URL to the picture and inserting into an image control which is added to the datagrid. Each PDB record has a name and clicking on the name in the results list will retrieve the result. The expert caption, (viewed by clicking on the icon) gives the parameters/settings used by a web service to get a result. Fig. 4e shows the expert caption for the EBI BLASTP. The expert caption can be copied by clicking on the ‘Copy Caption’ button. The caption can then be pasted in any word or text document and can be used in publishable content. Furthermore, full details of the workflow,

To create a workflow the user clicks on the “Create Workflow” button in the home page. The tools, tool locations in the workflow area, settings and input for each tool are stored in the client until the user wishes to save and run a workflow. The tool menu is presented to the user as a treeview structure and as the user clicks on the menu it expands to show the tools as leaf nodes in the tree structure. Each tool in the prototype has its own treeview structure showing what tools are available. When a workflow is being created, tools (represented as boxes) are linked by directional lines. A line represents the fact that results from one tool become the input for another tool and data always flows from one tool to another in one direction, data can never go backwards. The create workflow webpage will show users a menu of the available tools, when a tool is clicked in the list it is added to the workflow area and the menu changes to show which tools are compatible with the last tool added. Clicking on any tool in the workflow area will also show the compatible tools in the menu for that tool. This allows users to build workflows quickly, removing the need to browse the menu, a significant advantage for novice users who may not know what tools are compatible. Tool settings and UI data are saved as a string, from which a GUI is generated when needed. This approach allows changes to the

2032

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

Fig. 3. Part of the client class diagram used in the create workflow web page of Simplicity. CreateWorkFlow is the create workflow webpage including all the UI elements and the event handlers for buttons and other UI elements and all the programming code behind the GUI. The webpage GUI is described using XAML and is stored in a separate file to the programming code. Both the XAML and programming code are part of the one class. This class has list containers for storing ComponentInfo and LoadedXamlInfo. This class is where the XAML mark-up for tool settings is read and instantiated as a GUI. The event handlers respond to the user creating a popup page for settings, tooltip for help and information or for sending workflow information to the server and running a workflow. The ComponentInfo class stores tool details such as name, web service id, XAML, connected to, connected from, a list of lineInfo objects used to connect this component to another component and the X and Y coordinates of the ComponentInfo item in the workflow area. These details map directly to the Component_launched and template_component tables in the database. LineInfo is used to store details about the line such as where it is connecting from, connecting to and origin tool id. DrawLineDelegate is called when a line is moved in the workflow. When a tool is moved in the workflow area all the attached lines are also moved. The line position, length and angle must be worked out for each line moving and redrawn to give the illusion that a line is anchored to a tool as it is moved around the workflow area. The TreeMenuUpdate class is responsible for updating the tools menu. It contains code for several treeview UI elements and, depending on the tool clicked by the user, will show the appropriate treeview for that tool. The treeview for a tool shows the compatible tools the user can click on and add to the workflow area.

including workflow name, description and summary data can be combined to create a publishable paper or report, detailing results of a particular sequence analysis. When the user clicks on the ‘generate report’ button a request is sent to the server via RIA services to generate a report. A message box tells the user that the paper is being generated and a URL link will be emailed to the user when ready to download. A class on the server side retrieves the data for the workflow, tool settings and result summaries and lays out the paper. Once the paper has been generated and saved as a PDF file to a specific location on the server, an email is sent to the user with a link to download the paper. 3.3. Simplicity evaluation A user-centred approach was taken to the design, development and testing of the prototype tool. Usability evaluation of Simplicity involved a combination of inspection and testing methods. Inspection methods, such as cognitive walkthroughs and heuristic analysis were used to check the interface design by following established standards, such as Nielsen's Usability Heuristics [22]. The cognitive walkthrough is an evaluation technique for a task-oriented walkthrough of a user interface. The aims of the cognitive walkthrough were to discover how easy the system was to use and learn, discover confusing actions or tasks, find out if there was enough information for the user to make the next correct step in the task, assess if a task was completed efficiently and discover if anything was missing. The

cognitive walkthrough was performed with prototype screenshots of the software. The study involved eight participants, none of whom had seen the software before. A series of images were used to simulate a user performing a task with the software. The participants were given 5–10 s to work out what to do in each image given the task at hand and the information presented in the image, before the moderator explained what to do in the image. Usability issues were discovered when participants did not understand what happened or expected something else to happen in the image. The goal of a heuristic evaluation was to find usability problems in the user interface design and to resolve these problems as part of an iterative design process. The process involved eight evaluators of the Simplicity interface, using Nielsen's 10 usability principles (referred to as ‘heuristics’) as a guide. Usability or field testing was used to examine how ‘usable’ the software was, thereby helping to identify potential problems which may have been missed during the development phase. The advantage of usability testing is that it provides direct information about how users interact with the system as well as their opinion of the user interface [24]. In the current study usability testing was carried out using a combination of field testing with a group (eight participants, all biologists with varying levels of bioinformatics experience) followed by questionnaires. To provide structure and consistency to the field test, a task script (including common tasks typically performed with the software) was created for the users to complete. This script consisted of six tasks for users to complete; one task was to register on the Simplicity website allowing users to use the workflow tool, while

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

2033

Fig. 4. (a) Overview of a typical workflow involving three tools: EMBOSS GetOrf, EBI BLASTP and InterProScan, used to analyse the Ardmore bacteriophage genome. (b) The results for EMBOSS GetOrf are returned in a plain text format and are presented to the user in a richtextbox from which the data can be highlighted and copied. (c) Shows the –results for EBI BLASTP consisting of three parts: The Blast results list – a link to all the Blast reports for this tool are listed under the name of the query sequence. Clicking on a name loads the Blast report for a particular sequence. The Blast hits description – gives the main Blast details for the selected Blast report. The Blast alignments – all Blast details are displayed for each hit in the Blast alignment for the selected Blast report. (d) The RSCB PDB results are retrieved from the database in a XML format; the data is then parsed and inserted into a datagrid. The 3D protein structure image is downloaded directly from PDB by the client creating a URL to the image and inserting into an image control which is added to the datagrid. Each PDB record has a name and clicking on the name in the results list will retrieve the result. Users can quickly browse through the results by clicking on the PDB Blast hit names. The expert caption can be viewed by clicking on the icon in the EMBOSS GetOrf, EBI BLASTP or Interproscan tools in the workflow results page. The expert caption gives in paragraph form the parameters/settings used by a web service to get a result. (e) Shows the expert caption for EBI BLASTP.

the remaining tasks were to build workflows. CamStudio video capturing software (http://camstudio.org/) was used to record all screen activity on each user's desktop machine. After the usability testing was completed the videos were viewed by a usability expert who noted any problems arising. Immediately following usability

testing, users were required to complete a questionnaire to obtain their opinions about the software, the problems they encountered and features that were missing or could have been improved. A significant advantage of this approach is that it avoids interviewer bias arising from verbal or visual cues that can influence a respondent to answer

2034

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

in a particular way, thus affecting the validity and reliability of the data collection [25]. The usability techniques described above highlighted a number of usability problems, most of which have been, or are in the process of being, resolved. The initial GUI design concept is very different to the developed prototype; user feedback gathered from prototyping and cognitive reviews has been a significant driving force in the development of the software.

4. Discussion and Conclusion While several competitive workflow tools allow autotransformation of data, Simplicity's data presentation is unique. While many tools treat each result as a file or as an independent database item requiring the user to open files or windows to view results, Simplicity stores the result in a relational database so data from multiple Blast reports for example can be presented in a single, easy to use, window. This allows the user to browse the results quickly and efficiently. The report generation and expert caption feature of Simplicity (which provides publishable free text descriptors of figures and tables generated in the results) is also unique to Simplicity, helping the user to present the data in a readily publishable format. In line with the observations of Barker and Hemert [26], that tools need to be tailored for domain users rather than being built by computer scientists for computer scientists, Simplicity was specifically tailored for biologists (rather than for bioinformaticians) and as such, this target group was involved from the very early stages in the product's development, including interviews, focus groups, cognitive walkthrough in the GUI design stage and user testing of the application. In several competing workflow tools the transformation process must be added as an element of the workflow. While this approach gives the bioinformatician flexibility when creating workflows, it requires the user to have an understanding of the data formats and the transformation required between different tools, which is not always the case for biologists. With Simplicity users can input data without having to worry about transforming data for subsequent tools. The fact that the Simplicity has been designed and developed specifically to meet the needs of biologists means that for more advanced users it may lack flexibility, especially when compared to tools like Taverna or Kepler. Furthermore, in its existing form additional tools can only be added by the developers, and workflows cannot currently be shared between users. With respect to software implementation, while a user can have multiple workflows running at the same time, there is an upper limit dictated by the fact that Simplicity currently depends on third party web services; many of whom have a fair usage policies which limits access above a certain level of resource usage. EBI, for example, allows a maximum of 20 concurrent requests from a single IP address. Given that a server at one IP address could be serving many users, this is a significant bottleneck which needs to be addressed. File size is also a limiting factor, at present Simplicity cannot handle files at or above the Gigabyte scale. We are however currently reconfiguring Simplicity to accept much larger files such as those held at The Cancer Genome Atlas (TCGA). Further development is also on-going with some other key short-term goals including the establishment of a cross-platform client allowing greater flexibility and collaborative potential. Expanding on the report generator we plan to include a patent search feature, allowing users to search patent databases such as the Thomas Reuters patent database, coupled with an automated patent generator i.e. a feature which can automatically generate a patent document for each identified new gene or protein. While workflows or reports are currently not tagged with a DOI, future plans

include enabling workflows to be linked to external forums such as http://www.myexperiment.org/workflows, for example, where workflows across different platforms (e.g. Taverna) can be shared and downloaded. Simplicity is a bespoke bioinformatics management system that allows biologists to manage and analyse information generated from large scale ‘omic’ projects, facilitating quick transition from ‘project-to-publication’ or ‘lab-to-licence’. We engaged with over 120 professional researchers across the spectrum of biological sciences (using a qualitative analysis based approach involving both focus groups and online surveys) to develop a software framework that meets academic and industry demands. Simplicity has been developed as a cloud based Software as a Service (SaaS) solution, allowing for rapid deployment of extended features in response to increasing user needs. This is due to the managed online, secure and scalable infrastructure provided, reducing both implementation costs and risk, providing maximum flexibility to the customer.

Author declaration template Conflict of interest statement We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Acknowledgements RDS is an ESCMID Research Fellow and Coordinator of the EU FP7 IAPP project ClouDx-i. PW is nSilico’s chief executive and a PI on ClouDx-i. References [1] R.D. Sleator, C. Shortall, C. Hill, Metagenomics, Lett. Appl Microbiol 47 (2008) 361–366. [2] R.D. Sleator, An overview of the current status of eukaryote gene prediction strategies, Gene 461 (2010) 1–4. [3] R.D. Sleator, Proteins: Form and function, Bioeng Bugs 3 (2012) 80–85. [4] T. Manning, R.D. Sleator, P. Walsh, Naturally selecting solutions: The use of genetic algorithms in bioinformatics, Bioengineered 4 (2012) 266–278. [5] P. Lord, S. Bechhofer, M.D. Wilkinson, et al., Applying semantic web services to bioinformatics: experiences gained, lessons learnt, Semantic Web - Iswc 2004, Proceedings in: 3298 Proceedings of Semantic Web—ISWC, vol. 3298 2004, pp. 350–364. [6] D. Hull, K. Wolstencroft, R. Stevens, et al., Taverna: a tool for building and running workflows of services, Nucleic Acids Research 34 (2006) W729–W732. [7] T.C. Hudson, A.E. Stapleton, J.L. Brown, Codifying bioinformatics processes without programming, Drug Discovery Today: BIOSILICO 2 (2004) 164–169. [8] R.D. Sleator, A beginner's guide to phylogenetics, Microb Ecol 66 (2013) 1–4. [9] M. Johnson, I. Zaretskaya, Y. Raytselis, Y. Merezhuk, S. McGinnis, T.L. Madden, NCBI BLAST: a better web interface, Nucleic Acids Res 36 (2008) W5–W9. [10] R.D. Sleator, P. Walsh, An overview of in silico protein function prediction, Arch. Microbiol. 192 (2010) 151–155. [11] C.J. Sigrist, E. de Castro, L. Cerutti, et al., New and continuing developments at PROSITE, Nucleic Acids Res. 41 (2012) 344–347. [12] S. Ghosh, Y. Matsuoka, Y. Asai, K.Y. Hsin, H. Kitano, Software for systems biology: from tools to integrated platforms, Nat Rev Genet 12 (2011) 821–832. [13] A. Tiwari, A.K. Sekhar, Workflow based framework for life science informatics, Comput. Biol. Chem. 31 (2007) 305–319. [14] D. Hull, K. Wolstencroft, R. Stevens, et al., Taverna: a tool for building and running workflows of services, Nucleic Acids Res 34 (2006) W729–W732. [15] C. De Filippo, M. Ramazzotti, P. Fontana, D. Cavalieri, Bioinformatic approaches for functional annotation and pathway inference in metagenomics data, Brief Bioinform 13 (2012) 696–710. [16] P.B.T. Neerincx, J.A.M. Leunissen, Evolution of web services in bioinformatics, Brief Bioinform 6 (2005) 178–188. [17] S. Kumar, M. Nei, J. Dudley, K. Tamura, MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences, Brief. Bioinform. 9 (2008) 299–306. [18] Z. Zhang, K.H. Cheung, J.P. Townsend, Bringing Web 2.0 to bioinformatics, Brief. Bioinform. 10 (2009) 1–10.

P. Walsh et al. / Computers in Biology and Medicine 43 (2013) 2028–2035

[19] M. Held, W. Blochinger, M. Werning, E‐biology workflows with Calvin, in: G. Vossen, D.E. Long, J. Yu (Eds.), Web Information Systems Engineering‐WISE, vol. 5802, Springer, Berlin, Heidelberg, 2009, pp. 581–588. [20] P. Fraternali, S. Comai, A. Bozzon, G.T. Carughi, Engineering rich internet applications with a model-driven approach, ACM Trans Web 4 (2010) 1–47. [21] A.M. Davis, Operational prototyping: a new development approach, Software, IEEE 9 (1992) 70–78. [22] J. Nielsen, Usability inspection methods, Conference Companion on Human Factors in Computing Systems in: Proceedings of the Conference Companion on Human Factors in Computing Systems, ACM, Boston, Massachusetts, United States, 1994, pp. 413–414. [23] M. Henry, O. O'Sullivan, R.D. Sleator, et al., In silico analysis of Ardmore, a novel mycobacteriophage isolated from soil, Gene 453 (2010) 9–23. [24] A. Holzinger, Usability engineering methods for software developers, Commun ACM 48 (2005) 71–74. [25] R.D. Sleator, The evolution of eLearning background, blends and blackboard, Sci. Prog. 93 (2010) 319–334. [26] A. Barker, J. Hemert, Scientific Workflow: A Survey and Research Directions, in: R. Wyrzykowski, J. Dongarra, K. Karczewski, J. Wasniewski (Eds.), Parallel

2035

Processing and Applied Mathematics, vol. 4967, Springer, Berlin, Heidelberg, 2008, pp. 746–753. [27] R.D. Sleator, P. Walsh, An overview of in silico protein function prediction, Arch. Microbiol. 192 (2010) 151–155.

Roy D. Sleator graduated from University College Cork with a BSc in Microbiology, an MA in Education and a PhD in Molecular Biology, and holds a PGCert in Bioinformatics from The University of Manchester, UK. In 2006 he was awarded the Society for Applied Microbiology WH Pierce Prize. Sleator is a lecturer at the Department of Biological Sciences and a PI at Cork Institute of Technology’s Centre for Research in Advanced Therapeutic Engineering (CREATE) and the Alimentary Pharmabiotic Centre (APC) at UCC. He is also founding Editor-in-Chief of the scientific journal Bioengineered, published by Landes Bioscience, Austin Texas, USA.

Accelerating in silico research with workflows: a lesson in Simplicity.

Bioinformatics is the application of computer science and related disciplines to the field of molecular biology. While there are currently several web...
2MB Sizes 0 Downloads 0 Views