CHAPTER 19 Informatics for Molecular Biologists INTRODUCTION n the last few years, many areas of molecular biology have seen an accelerating shift of the role of bioinformatics from episodic data processing at later stages of a project to a deep and constant engagement, often starting with shaping the main hypothesis and experimental design, and followed by iterative data analysis that guides experimental steps throughout the project. Bioinformatics is an interdisciplinary field powered by “cross-pollination” of major biological problems and quantitative approaches to solve them. Not surprisingly, this field is extremely diverse, reflecting the variety of these problems, types of data amenable for analysis, and computational methods. This chapter includes some of the highly used resources and protocols in various bioinformatics areas.

I

By scope, the units of this chapter can be generally split into (a) primers on specific computational tools (BLAST, PAML, RAXML, PATIMDB) or wider analytical platforms (Galaxy, UCSC Genome Browser); (b) descriptions of publicly available data (protein databases, sequence databases, genomic datasets); (c) protocols describing broader workflows in a specific area that involve multiple state-of-the-art tools and datasets (analysis of protein sequences; analysis of microarray expression data); and (d) summaries of computational resources available in a major biological area (small regulatory RNAs, phylogenetic sequence analysis). The units of the chapter cover several major fields. Protein and nucleic acid sequence analysis is one the most prominent bioinformatics areas that can be applied in almost any project in molecular biology. A first step towards dissecting functional features of a sequence of interest is the detection and analysis of its homologs (evolutionary related sequences) in public sequence databases. The BLAST family of robust and well-trusted homology detection tools (UNIT 19.3) is based on the original BLAST (Basic Local Alignment Search Tool) first introduced in 1990 and is being further developed and maintained for public online use by National Center for Biotechnology Information (NCBI). The next step is the analysis of the family of detected homologs, including computational dissection of sequence patterns and signatures. This analysis often reveals important information about the evolution, structure, and function of the family as a whole, and of the gene or protein of interest in particular. PAML (UNIT 19.1) is an advanced method for the detection of sequence signatures of adaptive evolution in protein-coding DNA sequences; whereas RAXML (UNIT 19.11) is a tool for the phylogenetic analysis of protein sequence families. UNIT 19.11 also provides a brief introduction to the field of protein phylogenetic analysis in general. An important related topic is the variety of available sequence databases, their value in specific aspects of computational analysis, and the ways that researchers can submit their own data (UNIT 19.2). UNIT 19.4 is an introduction to a broader range of public dataset types that are focused specifically on proteins and include protein sequence, structure, protein classifications by families, superfamilies, and structural folds, as well as protein modifications, interactions, intracellular localization etc. As a guide to the combined application of various resources for protein sequence analysis, UNIT 19.5 provides a hands-on introduction to basic workflows that can generate new biologically meaningful information about the protein of interest: detecting homologs in

Current Protocols in Molecular Biology 19.0.1-19.0.2, April 2014 Published online April 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/0471142727.mb1900s106 C 2014 John Wiley & Sons, Inc. Copyright 

Informatics for Molecular Biologists

19.0.1 Supplement 106

public databases, identifying protein domains, predicting secondary structure, functional motifs, transmembrane regions, etc. UNIT 19.6 is another hands-on overview of basic workflows in a different widely used methodological field, the analysis of microarray gene expression data. This unit leads the reader through various stages of the process: experimental design, preprocessing of raw data, estimating expression values, detecting differentially expressed genes, as well as more advanced downstream analyses including classification, time series analysis, and detection of enriched functional gene sets.

Functional studies of small regulatory RNAs are a prominent rapidly developing area of research, both biologically and computationally. UNIT 19.8 is a concise, high-level overview of the main categories of small RNAs (siRNAs, miRNAs, and piRNAs) and corresponding bioinformatics resources to access and analyze small RNA data, including online databases, tools and portals. The recent development of computational genomics is arguably one of the most explosive areas in the history of bioinformatics. The research landscape of this field is rapidly changing in terms of questions for quantitative analysis, data acquisition technologies, and computational methods. A few tools, databases, and platforms, however, are likely to endure as major widely used resources, at least in the next few years. The UCSC Genome Browser, reviewed in UNIT 19.9, is a central online portal for the access, viewing, and basic analysis of a large number of sequenced genomes and a variety of genomic data, including the plethora of ENCODE (Encyclopedia of DNA Elements) datasets. Using specific workflow examples, this unit introduces the reader to both basic Browser functionalities and recent developments (e.g., track hubs, super-tracks) available to the user. Galaxy, reviewed in UNIT 19.10, is a multi-functional informatics tool for the management and analysis of high-throughput data that has proven to be invaluable to bench scientists without advanced computational expertise. An important Galaxy feature is an easy interface that gives the scientist a user-friendly access to many advanced analysis tools. This interface enables the user to run these tools, adjust their settings, and, if needed, incorporate new methods into Galaxy workflows. Random transposon insertion followed by high-throughput screening of resulting mutants has become a valuable tool in the comprehensive identification of functional bacterial genes. As an example of a computational application for sequence-based analysis of bacterial transposon insertion libraries in a high-throughput screen, UNIT 19.7 describes the software package PATIMB, which tracks sample processing and identifies the genes interrupted by a transposon insertion in each library mutant. In sum, computational tools, databases, workflows, and other resources described in this chapter can be used by experimentalists working in a wide variety of research areas to analyze both internal and publicly available data, answer specific biological questions, generate hypotheses, and guide further experiments. Ruslan I. Sadreyev Guest Editor Department of Molecular Biology Massachusetts General Hospital and Department of Pathology Massachusetts General Hospital and Harvard Medical School Boston, MA Informatics for Molecular Biologists

19.0.2 Supplement 106

Current Protocols in Molecular Biology

Informatics for molecular biologists.

Informatics for molecular biologists. - PDF Download Free
32KB Sizes 2 Downloads 3 Views