A collection of open source applications for mass spectrometry data mining.

Proteomics 2014, 14, 2275–2279

2275

DOI 10.1002/pmic.201400124

TECHNICAL BRIEF

A collection of open source applications for mass spectrometry data mining ´ Oscar Gallardo, David Ovelleiro, Marina Gay, Montserrat Carrascal and Joaquin Abian ´ CSIC/UAB Proteomics Laboratory, Instituto de Investigaciones Biomedicas de Barcelona-Consejo Superior de Investigaciones Cient´ıficas, IDIBAPS, Barcelona, Spain

We present several bioinformatics applications for the identification and quantification of phosphoproteome components by MS. These applications include a front-end graphical user interface that combines several Thermo RAW formats to MASCOTTM Generic Format extractors (EasierMgf), two graphical user interfaces for search engines OMSSA and SEQUEST (OmssaGui and SequestGui), and three applications, one for the management of databases in FASTA format (FastaTools), another for the integration of search results from up to three search engines (Integrator), and another one for the visualization of mass spectra and their corresponding database search results (JsonVisor). These applications were developed to solve some of the common problems found in proteomic and phosphoproteomic data analysis and were integrated in the workflow for data processing and feeding on our LymPHOS database. Applications were designed modularly and can be used standalone. These tools are written in Perl and Python programming languages and are supported on Windows platforms. They are all released under an Open Source Software license and can be freely downloaded from our software repository hosted at GoogleCode.

Received: April 4, 2014 Revised: June 3, 2014 Accepted: July 21, 2014

Keywords: Bioinformatics / Data analysis / Data handling / Phosphoproteomics / Search engine

Additional supporting information may be found in the online version of this article at the publisher’s web-site

A major unsolved problem in proteomics still lies in the analysis of MS data, often stored in proprietary formats, and/or with nonstandard analytical requirements. This typically involves converting data format, carrying out specific calculations not implemented in commercial software, using different search engines, repeating analytical tasks not yet automated, and integrating data from various sources. In the course of our studies on the T-cell phosphoproteome [1], we found a series of problems not addressed by the Correspondence: Dr. Joaquin Abian, CSIC/UAB Proteomics Lab´ oratory, Instituto de Investigaciones Biomedicas de BarcelonaConsejo Superior de Investigaciones Cient´ıficas, IDIBAPS, ´ 161, 6a planta, 08036 Barcelona, Spain Rosellon E-mail: [email protected] Fax: +34 93 581 49 13 Abbreviations: GUI, graphical user interface; HCD, higher-energy collisional dissociation; JSON, JavaScript object notation; MGF, MASCOTTM generic format; PQD, pulsed Q dissociation; TMT, tandem mass tags

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

available software, generally related to the lack of an Ascorelike calculation [2, 3] and phosphorylation reassignment capabilities, or to the handling of identification results from multiple proteomic search engines, as can be seen in Table 1. Therefore, we decided to develop a series of informatics tools to process MS data for the identification and quantification of proteomes, with special emphasis on phosphoproteomes. These tools include EasierMgf, FastaTools, Integrator, and JsonVisor packages, and front ends, such as SequestGui and OmssaGui. Although initially developed as tools integrated in the LymPHOS workflow [3], all of them can be used individually for analyzing general proteomics data. They are all Open Source programs under the GNU General Public License v3 [4], and can be downloaded from our GoogleCode repository (https://lp-csic-uab.googlecode. com/). Files from a practical demonstration of the full workflow using data from a real shotgun phosphoproteomics experiment are available in Supporting Information section. Colour Online: See the article online to view Fig. 1 in colour.

www.proteomics-journal.com

2276

´ Gallardo et al. O.

Proteomics 2014, 14, 2275–2279

Table 1. Comparison of some available software tools in relation to phosphoproteomics analysis

Trans-Proteomic Pipeline

OpenMS/TOPP

MASPECTRA 2

MaxQuant

Proteome Discoverer v 1.4

Open Source

Open Source

Free

Free

Commercial

Search engines supported

SEQUEST, Mascot, X!Tandem, OMSSAa)

SEQUESTb) ,

Can import results from multiple search engines (including SEQUEST, OMSSA and Phenyx)

Andromeda

SEQUEST, Sequest HT, Mascot, MSPepSearch, SpectraST, and MS Amandad)

Results integration Ascore-like calculation Phosphorylation reassignment

Yes

Mascot, X!Tandem, InsPect, CompNovo, PepNovo, PILIS, SpecLib, OMSSA, other engines exporting to mzIdentMLc) Yes

Noe)

No

No

Nof)

Nog)

No

Yes

Nof)

Nog)

No

No

phosphoRS algorithm (v 3.0)h) No

License

a) OMSSA [5] support is still in beta (testing) stage. b) For SEQUEST [6]: OpenMS/TOPP does not provide a user interface, but can import .out result files. c) Such as PEAKS [7], Phenyx [8] and EasyProt [9]. d) MS Amanda is a freely available search engine, but only for high resolution and high-accuracy tandem mass spectra [10]. e) To get integrated identification reports, the user can sieve the obtained results by means of the query and filter system of MASPECTRA 2. f) As for July 2014. Recently, LuciPHOr [11], a tool with this functionality, and fully compatible with the Trans-Proteomic Pipeline, became available as source code. g) As for May 2014. PhosphoScoring, an experimental tool with this functionality based on Ascore, is being developed as part of OpenMS. h) PhosphoRS algorithm calculates the individual probability values for each putative phosphorylated site, like Ascore calculation does; however, reassignments of p-sites must be carried out by the user.

Transformation of acquired raw data: EasierMgf is a Python graphical user interface (GUI) application to extract mass spectrum parameters and data (MS level, mass-intensity array, precursor peptide mass, scan number, retention time) from multiple Thermo binary RAW files, generating plaintext MASCOTTM generic format (MGF) files. EasierMgf can also manipulate data to remove +1 charged peptides and split MS2 and MS3 scans in two different files. However, the program’s most remarkable feature is that it can add fragmentation data from consecutive higher-energy collisional dissociation (HCD)/CID or pulsed Q dissociation (PQD)/CID MS2 scans to produce CID MS/MS spectra containing quantitative, low-mass information from HCD scans (iTRAQ or tandem mass tags (TMT) labels; Fig. 1) [12]. In this process, only quantitative data are added, thus maintaining the quality of the middle and high mass range of CID spectra. MS2 HCD and PQD data are also imported into their corresponding CID MS3 scan (when available), facilitating quantification of peptides identified only by MS3 scans. This application is unique in that it allows the quantitative analysis of MS3 identifications. Four different interfaces (front ends) to third-party tools (back ends) were implemented in EasierMgf: ReAdW and ReAdW4Mascot2 are interfaces to the corresponding tools developed as part of the Trans-Proteomic Pipeline and the NIST MSQC Pipeline [13], respectively, while extract msn classic and extract msn com are interfaces to different versions of the extract msn tool from Thermo’s Xcalibur Development Kit (extract msn.exe, version 4, and extract msn com.exe, version 5, respectively). The modular design of these front ends C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

makes the implementation of new interfaces for other thirdparty tools very straightforward to a Python developer. Search databases arrangement: FastaTools. Protein databases are commonly provided as FASTA formatted files that can be used directly by search engines or after processing to generate combined target-decoy databases that allow to determine the false discovery rate, and thus the quality of peptide identifications [14]. It is also often necessary to perform certain operations with the original FASTA files, such as analyzing its contents, performing queries, obtaining subdatabases, or joining several databases together. FastaTools is a small, easyto-use, single-file Perl GUI program developed as a tool to aid in these operations. Three search engines strategy: OmssaGui and SequestGui. To run batch searches using the free search engine OMSSA, we developed OmssaGui, a simple, user-friendly, Perl graphical environment that facilitates the use of OMSSA for identifying tandem MS spectra from multiple MGF files. So, the user only has to indicate a directory containing the MGF files and the location of the BLAST-indexed FASTA protein database, as well as some parameters, such as the enzyme to virtually digest the database, mass tolerances, and allowed modifications. In comparison with other interfaces available for OMSSA [15, 16], our OmssaGui includes a dedicated parser to directly convert the OMSSA output into a single spreadsheet-like file that is compatible with both R R Excel and Integrator. Microsoft SequestGui is a Perl graphical user interface that facilitates the use of the SEQUEST search engine for identifying MS/MS spectra from multiple MGF files in batch. Although


Proteomics 2014, 14, 2275–2279

2277

Figure 1. Main EasierMgf window after extraction of mass spectral data from a Thermo RAW file. ReAdW notebook tab is shown with the options used to call the ReAdW back end. Output messages from the back end can be seen in the log (lower right text box). MGF files were generated according to the processing options shown in the lower left area of the window. Addition of quantitative fragmentation data provided by the HCD scans to CID scans (in experiments using iTRAQ or TMT labels with consecutive HCD/CID scans) to produce combined HCD + CID MS/MS spectra is shown on the left.

new software, such as Proteome Discoverer, allows the use of MGF files, previous versions only use the Thermo RAW file format for searching. Search parameters for SEQUEST are loaded through a user-supplied BioWorks PARAMS file. Finally, the multiple .out search result files generated by SEQUEST are summarized and reported, like OmssaGui, in a single spreadsheet-like file (with XLS extension), easy to inspect and also compatible with Integrator. In both OmssaGui and SequestGui outputs, amino acid modification annotation is homogenized using the Unimod codes [17]. Search data integration and filtering: Integrator is a GUIbased application, developed in Perl, for the combination of search results obtained from the same MGF files using different search engines. As we described previously [18], parallel analysis of fragmentation spectra using three search engines increases the number and confidence of identified sequences. When each search engine uses different algorithms and score functions, identification of the same sequence by more than one engine greatly increases confidence in the match [14]. Using this approach, individual search results are produced and exported with minimal filtering. Integrator uses diverse parser modules to load the output data files. These parsers ensure that common data elements (charge, mass, peptide sequence, etc.) are loaded C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

within common key names regardless of their original denomination. This allows the integration processes and facilitates the implementation of new parser modules for other search engines. Currently, Integrator provides parsers for five search engines: SEQUEST, OMSSA, Phenyx, EasyProt, and PEAKS. For OMSSA and SEQUEST, search result files must R R Excel -compatible format generated by be in the Microsoft R R Excel format OmssaGui and SequestGui, and in Microsoft XLS or XLSX when exported by Phenyx or EasyProt, respectively. In the case of PEAKS, comma-separated values result files for each MGF input file must be exported to a folder named after the original MGF file, and such folders must be compressed into a zip file that is then read by the Integrator parser. This process is required for Integrator to recover the spectrum file name (from the folder name) that is not included in the PEAKS comma-separated values result files. Data from each search engine is introduced in the Integrator application without previous FDR filtering and including the decoy matches. Filtering of the correct assignations is done by the Integration software that considers only matches that are identified by two or more search engines. With this approach, the FDR in our data sets is about 0.7% (input data generated with cutoffs of Xcorr > 2, z-score > 5, and e-value < 1 for SEQUEST, OMSSA, and Phenyx engines, respectively). www.proteomics-journal.com

2278

´ Gallardo et al. O.

Proteomics 2014, 14, 2275–2279

Figure 2. Integrator input and output files. Input files include the output files from the corresponding search engines, and the MGF files obtained with EasierMgf. The FASTA database was the plain-text FASTA file generated with FastaTools containing target + decoy human protein sequences. All these files, as well as the corresponding JSON output file and the Report tab-separated values file generated by Integrator, are provided as Supporting Information.

Different processes are carried out by Integrator modules during integration: (i) Selection of those peptide hits found at least by two search engines. (ii) Reaching consensus among information regarding parameters, such as peptide sequence, Unimod annotation for amino acid modifications, MS stage and ion charge. (iii) Addition of spectral data collected from the original MGF files, such as scan number, parent ion m/z, and m/z intensity array. (iv) Q-Ascore calculation and phosphorylation reassignment [2, 3] using the spectral data contained in the MGF files for each identified phosphorylated peptide. This functionality is compatible with iTRAQ or TMT ion tags for quantification, and works with both MS2 and MS3 data. (v) Search of all matching proteins for each peptide in a provided, nonindexed FASTA file.


(vi) If needed, intensity of TMT or iTRAQ reporter ions is extracted from spectra in MGF files and reported for each peptide, including those identified in a MS3 spectrum. When the data integration processes are finished, Integrator outputs the results as a standard plain-text JavaScript object notation (JSON) file [19] with a .DB extension (Fig. 2). Due to the complexity of these .DB JSON files, generic JSON viewers or text editors are usually not adequate. Consequently, we implemented two different approaches for convenient data inspection: Integrator reports and JsonVisor. Integrator reports are spreadsheet-like files with XLS extension generated through the Report panel on Integrator by loading a previously generated .DB file (Fig. 2). JsonVisor is a graphical Python application developed to directly visualize the contents of .DB JSON files. The software allows navigating through a list of identified spectra, showing complete spectrometric and peptide identification


Proteomics 2014, 14, 2275–2279

information, and including a graphical view of each spectrum with assignation of its ion fragments. We have developed several bioinformatics applications directed to the identification and quantification of phosphoproteomes using MS. These tools solve some of the limitations we found in the available software, especially for assessment of phosphorylation sites by means of Q-Ascore probabilistic calculations and automatic reassignment, and the possibility of use the HCD/PQD data from MS2 to quantify MS3 fragmentation spectra. The Open Source nature and modularity of the applications allow custom modifications of the different software tools to adapt to any new requirements. This work was supported by grants BIO2009-11735 and BIO2013-46492 from the Spanish Ministerio de Ciencia e Innovaci´on. The CSIC/UAB Proteomics Facility of IIBBCSIC belongs to ProteoRed, PRB2-ISCIII, supported by grant PT13/0001. The authors have declared no conflict of interest.

References [1] Carrascal, M., Ovelleiro, D., Casas, V., Gay, M., Abian, J., Phosphorylation analysis of primary human T lymphocytes using sequential IMAC and titanium oxide enrichment. J. Proteome Res. 2008, 7, 5167–5176. ´ J., Gerber, S. A., Rush, J., Gygi, S. [2] Beausoleil, S. A., Villen, P., A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285–1292. [3] Ovelleiro, D., Carrascal, M., Casas, V., Abian, J., LymPHOS: design of a phosphosite database of primary human T cells. Proteomics 2009, 9, 3741–3751.

2279 mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17, 2337–2342. [8] Colinge, J., Masselot, A., Giron, M., Dessingy, T., Magnin, J., OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics 2003, 3, 1454–1463. [9] Gluck, F., Hoogland, C., Antinori, P., Robin, X. et al., EasyProt—an easy-to-use graphical platform for proteomics data analysis. J. Proteomics 2013, 79, 146–160. [10] Dorfer, V., Pichler, P., Winkler, S., Mechtler, K., MS Amanda: a new scoring system for high resolution MS/MS spectra. Austrian Proteomic Res. Symp. 2012. [11] Fermin, D., Walmsley, S. J., Gingras, A.-C., Choi, H., Nesvizhskii, A. I., LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using modified target-decoy approach. Mol. Cell. Proteomics 2013, 12, 3409–3419. [12] Dayon, L., Pasquarello, C., Hoogland, C., Sanchez, J.-C., Scherl, A., Combining low- and high-energy tandem mass spectra for optimized peptide quantification with isobaric tags. J. Proteomics 2010, 73, 769–777. [13] Rudnick, P. A., Clauser, K. R., Kilpatrick, L. E., Tchekhovskoi, D. V. et al., Performance metrics for liquid chromatographytandem mass spectrometry systems in proteomics analyses. Mol. Cell. Proteomics 2010, 9, 225–241. [14] Jones, A. R., Siepen, J. A., Hubbard, S. J., Paton, N. W., Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines. Proteomics 2009, 9, 1220–1229. [15] Tharakan, R., Martens, L., Van Eyk, J. E., Graham, D. R., OMSSAGUI: an open-source user interface component to configure and run the OMSSA search engine. Proteomics 2008, 8, 2376–2378.

[4] The GNU General Public License v3.0—GNU project—free software Foundation (FSF) n.d.

[16] Vaudel, M., Barsnes, H., Berven, F. S., Sickmann, A., Lennart Martens, L., SearchGUI: an open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 2011, 11, 996–999.

[5] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L. et al., Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3, 958–964.

[17] Creasy, D. M., Cottrell, J. S., Unimod: protein modifications for mass spectrometry. Proteomics 2004, 4, 1534–1536.

[6] Eng, J. K., McCormack, A. L., Yates, J. R., III, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989.

[18] Carrascal, M., Gay, M., Ovelleiro, D., Casas, V. et al., Characterization of the human plasma phosphoproteome using linear ion trap mass spectrometry and multiple search engines. J. Proteome Res. 2009, 9, 876–884.

[7] Ma, B., Zhang, K., Hendrie, C., Liang, C. et al., PEAKS: powerful software for peptide de novo sequencing by tandem

[19] Crockford, D., RFC 4627: the application/json Media Type for JavaScript Object Notation (JSON) 2006.



MzJava: An open source library for mass spectrometry data processing.

Phenobook: an open source software for phenotypic data collection.

mvp - an open-source preprocessor for cleaning duplicate records and missing values in mass spectrometry data.

jqcML: an open-source java API for mass spectrometry quality control data in the qcML format.

Crux: rapid open source protein tandem mass spectrometry analysis.

Mass Spectrometry Applications for Toxicology.

An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE).

Open-source mobile digital platform for clinical trial data collection in low-resource settings.

Calibration using constrained smoothing with applications to mass spectrometry data.

An evolving computational platform for biological mass spectrometry: workflows, statistics and data mining with MASSyPup64.

Greazy: Open-Source Software for Automated Phospholipid Tandem Mass Spectrometry Identification.

Optimizing data collection for public health decisions: a data mining approach.

An Open Data Format for Visualization and Analysis of Cross-Linked Mass Spectrometry Results.

Signatures for mass spectrometry data quality.

Analytical applications of electron monochromator-mass spectrometry.

Robotics-assisted mass spectrometry assay platform enabled by open-source electronics.

Glycosaminoglycans detection methods: Applications of mass spectrometry.

Applications of mass spectrometry to DNA sequencing.

Applications of chemical ionization mass spectrometry.

Applications of mass spectrometry for cellular lipid analysis.

Applications of ion-mobility mass spectrometry for lipid analysis.

High-resolution mass spectrometry associated with data mining tools for the detection of pollutants and chemical characterization of honey samples.

Mass Spectrometry Applications in Biomedical Research.

XGlycScan: An Open-source Software For N-linked Glycosite Assignment, Quantification and Quality Assessment of Data from Mass Spectrometry-based Glycoproteomic Analysis.