Recent advances in chemometric methods for plant metabolomics: A review Lunzhao Yi, Naiping Dong, Yonghuan Yun, Baichuan Deng, Shao Liu, Yi Zhang, Yizeng Liang PII: DOI: Reference:
S0734-9750(14)00183-9 doi: 10.1016/j.biotechadv.2014.11.008 JBA 6866
To appear in:
Biotechnology Advances
Please cite this article as: Yi Lunzhao, Dong Naiping, Yun Yonghuan, Deng Baichuan, Liu Shao, Zhang Yi, Liang Yizeng, Recent advances in chemometric methods for plant metabolomics: A review, Biotechnology Advances (2014), doi: 10.1016/j.biotechadv.2014.11.008
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Recent advances in chemometric methods for plant metabolomics: a review
T
Lunzhao Yi a*, Naiping Dongc, Yonghuan Yunb, Baichuan Denge, Shao Liud, Yi Zhangb, Yizeng Liangb a
MA NU
SC
RI P
Yunnan Food Safety Research Institute, Kunming University of Science and Technology, Kunming, 650500,China b College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, China c Department of Applied Biology and Chemical Technology, The Hong Kong Polytech nic University, Hong Kong, 999077, China d Xiangya hospital, Central South University, Changsha, 410008, China e Department of Chemistry, University of Bergen, Bergen, N-5007, Norway *Correspondence to: Lunzhao Yi, Yunnan Food safety research institute, Kunming University of Science and Technology, Kunming, 650500, China. Tel.: +86 871 65920302. E-mail address:
[email protected].
ED
Abstract
This review focuses on the recent and potential advances of currently available
PT
chemometric methods in relation to data processing in plant metabolomics, especially
CE
for the data generated from the mass spectrometry (MS) techniques. Recently, plant metabolomics has been gradually regarded as a valuable and promising biotechnology
AC
rather than an ambitious advancement. We here outline some significant developments of plant metabolomics, especially, in the combination of modern chemical analysis techniques, dedicated statistical, chemometric data analysis strategies. The advanced skills in the preprocessing of raw data, identification of metabolites, variable selection and modeling are illustrated. We believe that the insights into these developments are helpful to narrow down the knowledge gap between the molecular organization and metabolism control of plants. We here also discuss
the
limitations
and
perspectives
in
extracting
information
from
high-throughput datasets.
Keywords: plant metabolomics; chemometrics; biomarker; identification of metabolites; data preprocessing; modeling 1
ACCEPTED MANUSCRIPT Contents
CE
PT
ED
MA NU
SC
RI P
T
1. Introduction ................................................................................................................................... 2 2. Critique and discussion ................................................................................................................. 5 2.1 Pre-processing of raw data .................................................................................................. 5 2.1.1 Noise filtering and baseline correction..................................................................... 6 2.1.2 Peak detection and deconvolution ............................................................................ 8 2.1.3 Alignment ............................................................................................................... 11 2.1.4 Normalization......................................................................................................... 14 2.2 Identification of metabolites ............................................................................................. 17 2.2.1 Standards for reporting metabolite identification results ....................................... 17 2.2.2 Metabolite identification using GC-MS ................................................................. 19 2.2.3 Metabolite identification using LC-MS ................................................................. 23 2.3 Variable selection .............................................................................................................. 31 2.3.1Variable ranking ...................................................................................................... 33 2.3.2 Variable subset selection ........................................................................................ 34 2.3.3 Variable selection considering the interaction effect among variables ................... 36 2.4 Modeling of the data ......................................................................................................... 39 2.4.1 Unsupervised methods ........................................................................................... 40 2.4.2 Supervised methods ............................................................................................... 42 2.4.3 Non-linear methods ................................................................................................ 44 2.4.4 Model tuning and model validation ....................................................................... 47 2.5. One eye on the future ....................................................................................................... 51 3. Conclusions ................................................................................................................................. 52
AC
1. Introduction
Metabolomics refers to the comprehensive and quantitative analysis of metabolites and tries to gather as many as metabolic information from a biological system (Goodacre et al. , 2004). It is intriguing to be a reproducible and efficient method which can directly reflect biological events. Taking advantages of this method, a number of applications of cell, human and plant systems have already been published or predicted (Bertrand et al. , 2014, Deborde et al. , 2011, Rasmussen et al. , 2012, Toya and Shimizu, 2013). Recently, plant metabolomics has been rapidly upgraded from a promising concept to a widespread and valuable biotechnology (Cusido et al. , 2014, Davey et al. , 2005, Hall, 2006). The information gained from plant 2
ACCEPTED MANUSCRIPT metabolomics reflects much more details of a biological endpoint than what obtained from transcriptomics or proteomics. So far, it has been employed to quality control of
T
crop plants ((Biais et al. , 2012, Osorio et al. , 2012), plant ecology (van Dam and
RI P
Meijden, 2011), the study of stress biology in plants (Genga et al. , 2011), natural product discovery (Kell, 2006), et al.. Advances in plant metabolomics have increased
SC
exponentially in recent years (see Figure 1). At the same time, two modern analytical platforms, namely nuclear magnetic resonance (NMR) and mass spectrometry (MS),
MA NU
have become the methods of choice for metabolic analysis and have been generating massive amounts of data to answer the biological questions in plant metabolomics (Allwood and Goodacre, 2010, Allwood et al. , 2012, Kim et al. , 2011, Kueger et al. ,
ED
2012).
Insert Figure 1
PT
The raw data from metabolomics are the eventual sources of information, then in the turn of final knowledge (Goodacre, 2005). To make the mess metabolic information to
CE
be the valuable knowledge requires considerable data analysis including data
AC
preprocessing, statistical analysis, and suitable data storage (Allwood et al. , 2008, Goodacre, Vaidyanathan, 2004). So far, the improvement in analytical technologies makes the metabolomics datasets become gradually larger and more intricate in their inner structures (Boccard and Rudaz, 2014). It makes the coverage of metabolomics more comprehensive but will, as a result, demand chemometrics more and more (van der Greef and Smilde, 2005). The bottle-necks of plant metabolomics do not only depend on sample preparation and analytical platforms but also, even more importantly, on data analysis. Major changes on dimensionality and complexity of datasets lead to a significant shift for the knowledge discovery. In order to take advantages of metabolomics data to the largest extent, chemometrics has become a 3
ACCEPTED MANUSCRIPT crucial and dedicated tool for extracting valuable data from mess information (Boccard and Rudaz, 2014, Wolfender et al. , 2013). Chemometrics has a complete
T
theory and methodology for every step of metabolomics research, including sampling,
RI P
experiment design, data pre-processing, metabolite identification, variable selection and modeling. Chemometrics perfectly matches the requirement of metabolomics
SC
research. The reality is that chemometrics is one of the cornerstones of plant metabolomics. On the other hand, the complexity of metabolomics also puts massive
MA NU
challenges on chemometrics to deal with such massive high-dimensional data (van der Greef and Smilde, 2005).
Nowadays, plentiful review papers and guide books on plant metabolomics have been published (BaniMustafa and Hardy, 2012, Hall, 2011, Kim and Verpoorte, 2010,
ED
Villas ‐ Bôas et al. , 2005), providing informative and valuable guidance for
PT
researchers. Insights into the metabolomics experimental skills, including sample preparation and metabolite analysis, have also been revealed this year (Ernst et al. ,
CE
2014). Here, we attend to the recent advances in chemometric methods for data
AC
analysis of plant metabolomics. This review gives a brief but broad overview of the developed methods as well as challenges remaining in the data analysis of plant metabolomics, specifically, generated by MS, and perspectives on this topic. Various aspects are discussed, including raw data pre-processing, metabolite identification, variable selection and modeling. The flowchart of data analysis of plant metabolomics is shown in Figure 2. Insert Figure 2
4
ACCEPTED MANUSCRIPT 2. Critique and discussion
T
2.1 Pre-processing of raw data
RI P
Analytical instrument does not provide clean and comparable lists of metabolites, and raw data must be processed to generate a practicable data matrix in a variety of ways,
SC
including noise filtering and baseline correction, peak detection and deconvolution,
MA NU
alignment, and normalization (Castillo et al. , 2011). Data preprocessing is a very important step in data analysis of metabolomics. The key step here is eliminating the variance and bias in the data analysis process to reduce the complexity and enhance metabolically significant signals (Smith et al. , 2006). The development of algorithms
ED
and tools for data preprocessing is a critical issue of bioinformatics and chemometrics researches. As a result, many algorithms have been developed and multiple open
PT
source programs are applied to process raw MS data acquired by liquid
CE
chromatography–mass
spectrometry
(LC-MS)
or
gas
chromatography–mass
spectrometry (GC-MS). Among these tools, XCMS (https://xcmsonline.scripps.edu/) et
al.
,
2008,
Smith,
Want,
2006),
MZmine
AC
(Benton
(http://sourceforge.net/projects/mzmine/) (Katajamaa et al. , 2006, Pluskal et al. , 2010), OpenMS (http://open-ms.sourceforge.net/) (Sturm et al. , 2008) and MetAlign (http://www.metalign.nl)
(De Vos et al. , 2007, Keurentjes et al. , 2006, Moco et al. ,
2006, Tikunov et al. , 2005) have attracted particular attentions due to their practicability and effectiveness. Most research community of metabolomics is working with them. In addition, new programs have been steadily developed to increase the quality and efficiency of data preprocessing, such as MetSign (Wei et al. , 2011), MSFACTs (Duran et al. , 2003), TagFinder (Luedemann et al. , 2008, Luedemann et al. , 2012), MET-IDEA(Lei et al. , 2012), MathDAMP (Baran et al. , 5
ACCEPTED MANUSCRIPT 2006), and MetaboliteDetector (Hiller et al. , 2009). It needs to be pointed out that most of these tools as well as others are open-source and can be downloaded from the
T
internet freely. Furthermore, it is convenient to exchange such algorithms and data
RI P
within the community. Generally, tools for raw data preprocessing contain four basic modules, namely, noise filtering and baseline correction, peak detection and
SC
deconvolution, alignment and normalization. Hereby, we will introduce different
MA NU
chemometric algorithms and strategies for these modules.
2.1.1 Noise filtering and baseline correction
Noise filtering is designed to separate components‟ signals from background originating from chemical matrix or instrumental interference, remove measurement
ED
noises or baseline errors (Katajamaa and Orešič, 2007). Conventionally, in baseline correction of one way data (chromatographic or mass direction), two ends of a signal
PT
peak are manually pointed out by analysts, then, piecewise linear approximation was
CE
applied to fit a curve to be the baseline (Zhang et al. , 2010). However, the process is manual and time-consuming. Its accuracy highly depends on the user‟s operating
AC
skills (Jirasek et al. , 2004). In order to solve this problem, a large amount of algorithms have been developed for better estimation of the baseline. In 1977, Pearson proposed a classic baseline correction estimation algorithm (Pearson, 1977). It works iteratively and inspects the points in a specific interval, taking their standard deviation into account. For this method, the selection of parameters is extremely important; any slight mistake will lead to unacceptable large deviations. To overcome this weakness, numerous modifications have been developed to optimize the baseline correction method to make it faster, more robust and automatic. A large amount of algorithms were thus proposed, such as improved iterative polynomial fitting (Gan et al. , 2006), wavelet transform for de-noising (Shao et al. , 2003), low-order polynomial 6
ACCEPTED MANUSCRIPT smoothing filter based on Savitzky-Golay algorithm (Wang et al. , 2003), iterative asymmetric least-squares estimation (Eilers, 2004), and the elimination of background
T
spectrum (EBS) method (Boelens et al. , 2004). Most recently, two powerful
RI P
algorithms, selective iteratively reweighted quantile regression (sirOR) (Liu et al. , 2014a) and adaptive iteratively reweighted penalized least squares (airPLS) (Zhang,
SC
Chen, 2010), were developed by Liang‟s group. These algorithms can automatically and effectively remove baseline, regardless of whether it is linear or non-linear.
MA NU
Furthermore, they do not require the intervention experiences and prior knowledge, like peak detection, and run very fast and robust.
For MS-based datasets, the methods for removing random noises are typically
ED
implemented using traditional signal processing techniques in chemometrcs such as moving average window (Radulovic et al. , 2004) and median filter (Hastings et al. ,
PT
2002) in chromatographic direction, and Savitzky-Golay type of local polynomial fitting (Wang, Zhou, 2003) and wavelet transformation (Li et al. , 2005) in m/z
CE
direction. Noise filtering of LC-MS data is more complicated than that of GC-MS
AC
because chemical noises and random noises are both included. Chemical noise is typically induced by molecules in buffers and solvents and can be especially strong at the beginning and the end of the elution (Hilario et al. , 2006), while the random noise is mainly attributed to the detector. Chemical noise will cause a shift in the baseline in the intermediate mass range in LC-MS spectra. To resolve this problem, many filtering methods were proposed. For example, Haimi et al. fitted the baseline by first segmenting a spectrum and then performing a linear regression through the lowest points of smoothed spectrum segments (Haimi et al. , 2006). In addition, baseline removal method has also been approached by estimating background from a two-dimensional intensity image and then removing it with two orthogonal (retention 7
ACCEPTED MANUSCRIPT time and m/z) one-dimensional passes (Bellew et al. , 2006).
2.1.2 Peak detection and deconvolution
RI P
T
The purpose of peak detection and deconvolution is to identify and quantify the signals corresponding to the molecules (e.g. the metabolites) in a sample (Castillo,
SC
Gopalacharyulu, 2011). It is the fundamental step for the downstream data analysis, such as profile alignment and biomarker identification, and can reduce the complexity
MA NU
of the data (Katajamaa and Orešič, 2007). However, owing to the complexity of the signals and the multiple sources of noises in data, the automatic identification of the noise and compound signals is a very difficult issue. The threshold between noise and signal is hard to identify, especially in detecting peaks with low response values.
ED
A peak detection method can identify the true signals correctly and avoid the false positives. Here appears a big problem that the high response values do not always
PT
guarantee real peaks because some sources of noise can also produce high signals.
CE
Conversely, low peaks may correspond to real signals. Therefore, constraints on the peak shapes and criteria of minimal intensity, area or signal-to-noise are widely
AC
applied to distinguish real peaks from noises. Several parameters generally need to be adjusted to match the characteristics of the MS-based data. Traditionally, peak detection algorithms have followed two strategies, either by derivative techniques or by matched filter response (Felinger, 1998). For derivative-based peak detection methods, they make use of the fact that the first derivative of a peak will have a positive-to-negative zero-crossing at the local maxima of a peak (Vivó-Truyols et al. , 2005). Derivative-based methods commonly require increasingly elaborate pre-processing to prevent compounding noise effects (Krishnan et al. , 2012, Pierce and Mohler, 2012). A threshold on the slope is often imposed to avoid false positives. 8
ACCEPTED MANUSCRIPT Matched filtering is achieved by the application of a linear filter, which is designed to detect the presence of a particular pulse event with a known structure embedded in
T
additive noises (North, 1963). One may perform a threshold in the response function
RI P
to determine the location of chromatographic peaks when applied to chromatographic data, assuming a Gaussian peak shape (Danielsson et al. , 2002). Matched filter
SC
methods are becoming progressively sophisticated as data complexity increases. So far, some popular or open source software packages were derived, such as MapQuant
MA NU
(Leptos et al. , 2006) and XCMS (Smith, Want, 2006). XCMS includes three steps, binning, signal determination and filter. One weakness of initially proposed method in XCMS is that the peaks can sometimes be alternatively assigned to two adjacent m/z
ED
bins. One potential solution to the problem involves combining adjacent extracted ion chromatograms which represent the analyses of interest, but this algorithm cannot
PT
resolve pairs of co-eluting peaks that fall within half of m/z bin. Then, the developers of XCMS software added another algorithm to solve this problem, called centWave
CE
(Tautenhahn et al. , 2008). The centWave algorithm was performed by using
AC
continuous wavelet transform (CWT) to detect the chromatographic peaks with different widths and intensities. Every peak is checked by the maximum value of the centroid peak in the estimated peak boundaries. In addition, CWT was also applied to build a robust pattern matching method in MS peak detection. The CWT is directly applied to the raw spectrum. The information from the two-dimensional CWT coefficients matrix is utilized. By identifying peaks and assigning signal-to-noise ratio in the wavelet space, the pattern matching problem was simplified. The issues surrounding the baseline correction were removed simultaneously, and the preprocessing steps, such as noise filtering and baseline correction, are not required before peak detection (Du et al. , 2006). 9
ACCEPTED MANUSCRIPT Selecting an optimal threshold for the above mentioned two strategies is an of essential importance and difficult problem which has been thoroughly discussed in
T
various peak detection approaches (Hastings, Norton, 2002, Leptos, Sarracino, 2006,
RI P
Vivó-Truyols, Torres-Lapasió, 2005), but with no general consensus being reached. Recently, some algorithms were developed based on Bayesian inference (Lopatka et iv -Truyols, 2012). The algorithm makes use of chromatographic
SC
al. , 2014,
information (i.e. the expected width of a single peak and the standard deviation of
MA NU
baseline noise), which is regarded as prior information. Then, the probability of a signal being a peak is estimated, based on some theories or hypotheses, such as statistical overlap theory (Lopatka, Vivó-Truyols, 2014).
ED
In the high-throughput analysis of metabolites, overlapping peaks are ineluctable. This kind of problems can be resolved by mass spectral deconvolution or
PT
two-dimensional data resolution methods that have been well developed by chemometrics community using matrix computation combined with characteristics of
CE
MS data (Hantao et al. , 2012, Liang and Kvalheim, 2001, Ruckebusch and Blanchet,
AC
2013). So far, many multivariate resolution methods, such as heuristic evolving latent projections (HELP) (Kvalheim and Liang, 1992, Liang et al. , 1992), evolving factor analysis (EFA) (Maeder, 1987), subwindow factor analysis(Manne et al. , 1999) , alternative moving window factor analysis (AMWFA) (Zeng et al. , 2006) and evolving window orthogonal projection (EWOP)(Xu et al. , 1999) have been employed in different application fields. These methods provide strong ability to resolve overlapped mixture peaks, even embedded chromatographic peaks, into pure chromatograms and MS spectra of the components in mixture. With the help of these deconvolution methods, the coverage of metabolites will be enlarged in a single run with the present analytical instrument. Furthermore, the identification accuracy of 10
ACCEPTED MANUSCRIPT metabolites will be improved. For example, AMWFA has been successfully applied to resolve the overlapped peaks in GC-MS analysis of Pericarpium Citri Reticulatae
T
Viride (PCRV) and Pericarpium Citri Reticulatae (PCR) (Wang et al. , 2008), as
RI P
shown in Figure 3. More volatile components were identified in this study with the help of AMWFA. And, the identification accuracy was significantly increased for the
SC
overlapped peaks. In addition to these resolution methods, Automated Mass spectral Deconvolution and Identification System (AMDIS, NIST) and commercially
MA NU
available tools, Deconvolution Reporting Software (DRS, Agilent), AnalyzerPro (SpectralWorks) and ChromaTOF®(LECO) can also be used for this aim. Insert Figure 3
ED
2.1.3 Alignment
The shifts of retention time or m/z are inevitable in experimental analysis of
PT
metabolites. Experimental factors, including the temperature, column, pH, sample
CE
carryover or degradation, are able to lead to deviations that may affect the overall signal. The alignment of detected features in different samples aims at removing the
AC
shifts among samples for a given signal, which will guarantee the extraction of useful information by using chemometrics or informatics methods in the following steps. Peak shifts have strong impact on the multivariate statistical analysis, such as principle component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA). Inappropriate peak alignments will result in totally illusions of classification and biomarker screening. So far, several alignment techniques have been developed to minimize the run-to-run shifts (Bylund et al. , 2002, Prince and Marcotte, 2006, Tomasi et al. , 2004). Chromatographic systems coupled with sophisticated detection instruments, e.g. LC-MS, have yielded large amounts of two-dimensional data in metabolites‟ analysis. 11
ACCEPTED MANUSCRIPT If we use the traditional peak alignment methods to process these data, the dimensionality needs to be reduced. It could be achieved by generating integrated
T
peak areas or total ion chromatograms (TICs). For one dimensional data (such as
RI P
TICs), some kinds of time alignment procedures could be employed as a useful method for tackling this problem of retention time shifts (Johnson et al. , 2003), such
SC
as correlation optimized warping (COW) (Nielsen et al. , 1998) and dynamic time warping (DTW) (Pravdova et al. , 2002), recursive alignment by fast Fourier
MA NU
transform (RAFFT) (Wong et al. , 2005). Original COW needs large execution time and memory when dealing with huge hyphenated datasets. And, artifacts often appear in the aligned fingerprints by DTW because it often over-warps signals when signals
ED
are just recorded by a mono-channel detector. RAFFT efficiently accelerates the aligning procedure by fast Fourier transform cross-correlation. However, using
PT
RAFFT may take the risk of distorting the shapes of peaks. It is because RAFFT does not consider the peak information when moving segments, just inserting and deleting
CE
data points only at the start and the end of segments, which may introduce artifacts
AC
and remove peak points. Note that nonlinear retention time shifts often exist for the chromatogram results of a real sample. To solve the nonlinear shift problem, some algorithms were proposed, such as nonlinear alignment by moving window fast Fourier transform (MWFFT) cross-correlation (Li et al. , 2013b), a multi-scale peak alignment (MSPA) approach (Zhang et al. , 2012b). Among these, MSPA involves iteratively dividing a chromatogram into small segments to solve the nonlinear retention time shifts problem in alignment. Then, FFT cross correlation is used to estimate candidate shifts and gradually align peaks step by step. A simple example of the application of MSPA method is demonstrated in Figure 4. Retention time shifts of GC-MS TICs in different samples are removed successfully by it. There are some 12
ACCEPTED MANUSCRIPT other alternatives existing, such as kernel density (Smith, Want, 2006), component-resolving algorithms (Andreev et al. , 2003), progressive clustering (De
T
Souza et al. , 2006), etc.. The other attempts for alignment are to integrate the peak
RI P
areas. Though it is time-consuming and meticulous, it is also the process of “data cleaning” because retention time shift, noise pollution, background shift will be
SC
cleared at the same time. Insert Figure 4
MA NU
In the dimension reduction step, loss of information is inevitable. To handle this problem, another kind of strategy is to model the high-dimensional data by multi-way analysis methods directly. Then, the so-called two dimensional advantages (e.g. mass
ED
spectra information of metabolites) will be maintained. For examples, the alignment methods by Prakash et al. (Prakash et al. , 2006) and ChromAlign method (Sadygov et
PT
al. , 2006) are both using the raw high-way data . Firstly, these algorithms compute the similarity scores of a matrix between pairs of spectra. Then, dynamic
CE
programming is applied to find an optimal path through the matrix and define the
AC
mapping of paired spectra. In the method proposed by Pierce et al., a piecewise single dimension retention time alignment algorithm is applied to align two-dimensional data (Pierce et al. , 2005). As to the continuous profile model (CPM) method, the two dimensional data is divided into four m/z bins as opposed to align only a single TIC (Listgarten et al. , 2007). In addition, some algorithms were upgraded to alignment the two-way retention time shift more comprehensively, such as the algorithm using novel indexing scheme (Pierce, Wood, 2005). The algorithm aligns the fingerprints in different dimensions at the same time, preserving the separation information in both dimensions. This method is suitable to correct shifting in different kinds of two-dimensional separations, such as GCⅹGC, LCⅹLC, LCⅹCE and LCⅹGC. In 13
ACCEPTED MANUSCRIPT addition, gap filling is often used after peak alignment to fill the missing values when not all peaks could be detected in samples. This procedure is necessary to avoid the
T
inclusion of too many zero values because it will have negative effects on the next
RI P
data modeling.
SC
2.1.4 Normalization
Normalization is to remove the confounding variation due to experimental sources,
MA NU
such as analytical noise, experimental bias, and retain the relevant variation due to biological events (Castillo, Gopalacharyulu, 2011). If the signal of the majority of all metabolites is stable, a simple and efficient normalization could be achieved by calculating the relative ratio of the abundance of analytes to all other peaks, such as
ED
the unit norm and median intensities normalization (Wang, Zhou, 2003). However, the assumption of negligible overall concentration changes is hard to be achieved; the
PT
total concentrations of analytes may change considerably due to both laboratory
CE
system errors and differences of large scale biological experiments. In this case, scaling based on the total chromatogram may seriously distort the data.
AC
Due to analytical reasons, measurement errors of minor metabolites with low concentrations are bigger than those primary metabolites with high abundance. As we all know, compounds with lower concentrations will be easily altered by analytical noise. To make the different metabolites comparable, scaling procedure is required to normalize the variances. Different scaling methods, such as, autoscaling (1/SD) and Pareto (1/sqrt(SD) could be performed. Autoscaling is the most popular normalization method used in metabolomics, in which each variable has equal (unit) variance by multiplying with the inverse of standard deviation (SD). Pareto is softer than autoscaling. For this method, each variable is weighted with the inverse of sqrt(SD). It can increase the importance of low abundant compounds without significantly 14
ACCEPTED MANUSCRIPT amplifying the noise. During data analysis, many researchers tended to assume that the total variation
T
originating from sampling, analytical measurements and biological events is with
RI P
equal standard deviations and symmetrically around zero (van den Berg et al. , 2006). However, this assumption is not available in many cases. Biological effects related to
SC
concentration alterations could vary dramatically for different metabolites..This variation related to certain metabolite is named heteroscedasticity, which could be
MA NU
detrimental for observing a particular biological situation (van den Berg, Hoefsloot, 2006). A mathematical transformation is helpful to correct the skewed data before modeling. The log transformation (Kvalheim et al. , 1994) and the power
ED
transformation(Sokal and Rohlf, 1995) are two well-known methods that have been applied to correct heteroscedasticity. When the relative standard deviation is constant,
PT
a log transformation can perfectly remove heteroscedasticity (Kvalheim, Brakstad, 1994). However, the log transformation has a drawback. It will approach minus
CE
infinity when the values are transformed approaching zero. Power transformation
AC
does not have the near zero artifacts and has the similar results with that after the log transformation.
Another sophisticated strategy for normalization is the internal standards (ISs) method, e.g. isotopically labeled internal standards, and quality control (QC) samples in each data acquisition procedure (Gika et al. , 2008b). Comprehensive and representative IS-based normalization is based on a key assumption that variance exhibited by ISs solely comes from component with systematic error. Unfortunately, this is not always the real situation. Both insufficient chromatographic separation and ion suppression will result in both concentration alterations in one component and variance in the measurement of a different one. If such confounding between analytes and ISs occurs, 15
ACCEPTED MANUSCRIPT direct normalization using the ISs may suppress the signal and lead to the loss of information. There is a principle that a representative IS used for normalization
T
should be similar with the analyte, and the systematic error should have an effect on
RI P
them indiscriminately. The IS is often selected from specific regions of retention time (RT) or m/z. However, the selected RT or m/z could not represent all matrix or
SC
chemical properties, which will result in obscuring data variation (Castillo, Gopalacharyulu, 2011). Any single IS could not estimate the systematic error of a
MA NU
complex biological matrix. Therefore, multiple ISs works better in these cases. Furthermore, the using of IS must try to decrease the risk of cross-contribution (CC). If the masses using for quantify the IS are carefully selected, this problem can be
ED
solved easily (Liu et al. , 2002). However, this attempt is nontrivial in the metabolomics research because the biological sample is too complex. It is difficult to
PT
predict which ions will be cross-interfering. The presence of CC effects can cause the loss of information seriously, especially when the interfering analytes are related with
CE
the interested factors in metabolomics data sets. H. Redestig et al. presented an
AC
effective normalization algorithm which could compensate for systematic CC effects. It is able to improve the normalization of mass spectrometry based metabolomics data (Redestig et al. , 2009). On the other side, in order to image the global variability of a measurement system, performing QC before normalization is recommended when visualizing the data by PCA. Generally, QCs is a pool of several individuals having similar characteristics. The studied samples are compared with QCs to evaluate the variability. In multivariate statistical analysis, such as PCA, the QC samples should appear closely on the scores plot, which indicates that the analytical system has good reproducible performance (Gika et al. , 2008a).
16
ACCEPTED MANUSCRIPT 2.2 Identification of metabolites Confidently identifying metabolites from MS spectrum data has been generally
T
recognized as a significant challenge in plant metabolomics community, especially in
RI P
untargeted analysis. Though the investigation of this topic was initiated much earlier than that for protein and DNA sequencing, identification of metabolites started
SC
entering upon the high-throughput and automated level just until recent years. The
MA NU
main factor causing this delayed breakthrough is the biochemical diversity of metabolites. However, benefitting from advanced computational techniques and methods, advanced mass spectrometry instrumentation, wealth of knowledge on ion fragmentation, well established databases and libraries, especially fruitful works in the
ED
past one decade, metabolite identification has the ability to cover unknowns with reasonable accuracy and could be performed in a high-throughput manner. A variety
PT
of overviews have been published on this topic and comprehensive summaries of
CE
different identification strategies can be found (Kind and Fiehn, 2010, Scheubert et al. , 2013, Wishart, 2009), whereas instructions for practical use can be obtained from
AC
(Neumann et al. , 2013, Watson, 2013) and a nice guide for beginners of mass spectrometry is in (Holcapek et al. , 2010). Thus, we are going to briefly cover currently available algorithms and tools valuable for metabolites identification using MS in this section.
2.2.1 Standards for reporting metabolite identification results Since outcome of metabolomics research strongly depends on biology conditions and experimental measurement, formulating a standard for reporting the outcome is of great importance for adopting common semantics in literatures, evaluating current works (both for readers and peer-reviewers of journals), exchanging data and storing 17
ACCEPTED MANUSCRIPT data in public repository(Field and Sansone, 2006). The most frequently referenced standard for metabolite identification is proposed by Chemical Analysis Working
T
Group (CAWG), a part of Metabolomics Standards Initiative (MSI)(Sumner et al. ,
LEVEL 1. Unambiguously identified compounds
RI P
2007), which can be classified into four levels:
require
comparison
of
SC
authentic chemical standard with other two or more orthogonal properties
MA NU
analyzed under identical analytical conditions; LEVEL 2. Putatively identified compounds
based
on
comparison
of
physicochemical properties and/or spectral similarity with public or commercial spectral libraries without authentic chemical standard; based on comparison of
ED
LEVEL 3. Putative identification of compound classes
physicochemical properties of a chemical class of compounds, or spectral
PT
similarity to known compounds of a chemical class; unidentified and unclassified, but can still be
CE
LEVEL 4. Unknown compounds
differentiated and quantified by spectral or chromatographic data;
AC
This standard along with for example guidelines for hyphenated MS experiments and data preprocessing provides a sound basis for metabolomics studies. The drawback is also apparent:
it cannot be quantitatively characterized for example using
probability, thus is still arbitrary(Creek et al. , 2014). For instance, in a high throughput analysis, it is impossible to collect all authentic metabolites to achieve the LEVEL 1 identification but more practical to achieve the LEVEL 2 using accurate mass with the assistance of isotopic distribution or reference library search assisted with retention index. However, the latter can also gain unambiguous identifications by appropriate criteria like false discovery rate of 5%(Kim and Zhang, 2014). A recent suggestion is to employ confidence levels to solve this problem (Schymanski et al. , 18
ACCEPTED MANUSCRIPT 2014), but much more efforts are still required. In addition to this standard, there are some other guidelines which may be referential, for example, EU Commission 2002/657/EC
T
Decision
using
RI P
(http://ec.europa.eu/food/food/chemicalsafety/residues/lab_analysis_en.htm),
2.2.2 Metabolite identification using GC-MS
SC
identification point system to score the identification results.
MA NU
GC-MS has been routinely used in metabolomics and is considered as a golden standard technique for high throughput metabolomics studies (Fiehn and Spranger, 2003, Lisec et al. , 2006). The main advantages of this technique are its robustness, high reproducibility in both chromatographic and mass spectrometry directions, high
ED
sensitivity, and existence of mature protocols for sample preparation and data processing. It is one of the oldest techniques in analytical science. Moreover, though
PT
GC-MS analysis requires that the analytes are volatile and thermally stable, the range
CE
can be considerably extended by chemical derivation(Villas-Boas et al. , 2005). Therefore, a great effort has been made to interpret MS spectra from electron impact
AC
(EI) ion source. Currently, most frequently adopted and reliable method is library search. In this method, each experimental MS spectrum is compared with reference MS spectra in mass spectral library. Then, similarity scores were calculated. The corresponding library compound gaining the highest similarity score is theoretically considered as the one that generates this experimental spectrum. The commonly adopted mass spectral libraries are listed in Table 1. Main factors that influence the search results include quality of experimental MS spectra, size of mass spectral library and similarity score calculation algorithms(Koo et al. , 2013). From the arithmetic point of view, the method for calculating the similarity score is the most important factor because the quality of MS spectra significantly depends on the 19
ACCEPTED MANUSCRIPT experiment, and the libraries are generally commercially available that cannot be freely configured by users and still relatively small size. Previous investigation
T
showed that the most robust similarity score calculation method was dot product
RI P
configuring with square root operation of mass spectral intensities(Stein and Scott, 1994). However, several other algorithms such as cross-correlation(Eng et al. , 2008,
SC
Powell and Hieftje, 1978) and probability based algorithms such as probability based matching (PBM) system (McLafferty et al. , 1974) and X-Rank (Mylonas et al. , 2009)
MA NU
were also powerful. In addition, each MS workstation of GC-MS instrument also has its own algorithm to calculate the similarity between experimental and library MS spectrum as well. And MassLib adopted SISCOM(Damen et al. , 1978) system to
ED
perform the library search.
PT
Insert Table 1
Due to the complexity of metabolites and their EI-MS spectra, for example, the
CE
existing of isomers and mass spectra generated by co-eluted components, a target compound does not ideally gain the highest similarity score. More generally, it locates
AC
at higher rank (e.g. second or third rank or higher) in the hitlist. Thus, careful manual checking is always required. As a consequence, taking other information like retention index (RI, e.g. Kovat‟s retention index) of target compound into consideration will be very helpful (Dunn et al. , 2011, Kopka, 2006). RI is a structurally and physicochemicallyspecific indicator and can effectively differentiate compounds having similar mass spectra. Actually, this indicator along with EI-MS spectrum makes up the mass spectral tag (MST) widely accepted in plant metabolomics and organizes the Golm Metabolome Database(GMD)(Kopka et al. , 2005, Schauer et al. , 2005, Wagner et al. , 2003) and BinBase/FiehnLib(Kind et al. , 2009). And, NIST standard reference database includes a large number of RI values. 20
ACCEPTED MANUSCRIPT Another improvement, especially for the case of co-elution, can be achieved by mass spectral deconvolution or two-dimensional data resolution methods which can resolve
RI P
and mass spectrum of each component (see section 2.1.2).
T
overlapped peaks generated by those co-eluted components into pure chromatography
Though mass spectral library search provides promising confidence for metabolite
SC
identification , the greatest drawback of this strategy is that currently released libraries
MA NU
are still far from covering the whole metabolites in plants, making a large number of metabolites not in the libraries unidentifiable. This problem is faced by all scientific fields involving compound identification. Attempting to overcome this disadvantage triggered the development of one of the oldest artificial intelligence system named
ED
DENDRAL Project(Lindsay et al. , 1993), initiated in 1965, to study the relationships between mass spectra and compound structure. Unfortunately, this project was finally
PT
failed. However, based on pioneer works for DENDRAL Project and other mass
CE
spectrum interpretation systems in the early days, numerous methods suitable for compound identification independent of mass spectral library were developed. These
AC
methods can be divided into two series. One series of methods are to learn structure features of compounds from their experimental mass spectra and then deduce unknown structure from the features of a given spectrum according to the previously constructed learning models. There are two ways to achieve this. The first one is to exhaust all possible isomers according to the molecular mass extracted from input MS spectrum by structure generation module (e.g. MOLGEN(Benecke et al. , 1995) and OMG(Peironcely et al. , 2012)) and retain the structures that best explain the spectrum according to fragmentation rules. Generally, machine learning algorithms are adopted in this procedure to identify whether a substructure is presented in the unknown compound. This can filter out large number of isomers without the identified 21
ACCEPTED MANUSCRIPT substructures (Schymanski et al. , 2008). MOLGEN-MS(Kerber et al. , 2001) and MassLib have been developed in this manner. The web based algorithm embedded in
T
GMD, however, employs decision trees to predict 166 commonest functional groups
RI P
in plant metabolites after training known metabolites in GMD using corresponding mass spectra data and retention indices(Hummel et al. , 2010), providing invaluable
SC
information for inferring structures of unknown metabolites. The other way is based on the library search under the assumption that similar structures have similar spectra.
MA NU
Possible substructures of unknown compound can then be deduced from the library compounds having top similarity scores(Stein, 1995).The alternative series of methods are to predict mass spectrum for input molecule directly. Based on wealth of
ED
knowledge of ion fragmentation and aided by advanced computational technologies, accurately predicting mass spectrum becomes feasible. Mass Frontier (Thermo
PT
Scientific), one of the most commonly adopted software for structure elucidation, uses HighChem Fragmentation Library which stores about 31,000 fragmentation
CE
mechanisms to predict and interpret experimental mass spectra. Commercial software
AC
ACD/MS Fragmenter(ACD/Labs) is also very powerful in MS spectrum prediction and gains its popularity in metabolomics community. Similarly, freely available tool Mass Spectrum Interpreter released by NIST uses thermo chemical kinetics of general fragmentation reactions summarized from known fragmentation rules to predict mass spectrum. Among these powerful methods, a common difficulty is that they cannot effectively extract correct structures from their isomers, as has been pointed out after comparing different tools(Schymanski et al. , 2009). However, improvements can be made by the combination of different tools (Schymanski et al. , 2012). In addition to the above methods, unknown compounds can also be putatively identified by MS spectral characteristics combining with other information. For 22
ACCEPTED MANUSCRIPT example, combination of accurate molecular mass to charge ratio (m/z) provided by chemical ionization, in-silico predicted retention index and fragmentation pattern can
T
effectively constraint the number of candidate compounds in histlist (even to single
RI P
one) without requiring any mass spectral library(Fiehn et al. , 2000, Kumari et al. , 2011). A practical guide for small molecule structure elucidation using several
MA NU
libraries can be found in(Zhang et al. , 2013b).
SC
strategies which differ from above computational methods and without mass spectral
2.2.3 Metabolite identification using LC-MS
As has been mentioned in previous section, GC-MS analysis is standardized by, e.g. fixing ionization method (i.e. EI) under fixed energy (i.e. 70ev), which ensures that
ED
the mass spectra generated are robust and highly reproducible among instruments and laboratories. As consequences, the reference mass spectral libraries are standardized
PT
and well quality controlled and the mechanisms of fragmentation during EI are
CE
extensively known now, making the identification of compounds highly maneuverable and the quality of results assessable. The biggest limitation for GC-MS,
AC
however, is the requirement of volatile and thermally stable analytes or additional derivatization step to render some polar and non-volatile species (Villas-Boas, Mas, 2005). This dramatically reduces the range of analyzable species since much more species like secondary metabolites are non-volatile or have higher molecular weight. Further, derivatization will complicate the interpretation of mass spectra. In contrast, LC-MS does not require the species to be volatile and can be used to analyze compounds with heat-labile functional groups, chemically unstable substructures or high molecular weights and so forth, thus can analyze a much wider range of plant metabolites than GC-MS. Moreover, the sample preparation for LC-MS is simpler (Kim and Verpoorte, 2010, Wu et al. , 2012). These great advantages of LC-MS along 23
ACCEPTED MANUSCRIPT with advanced instrumentation, for example, development of electrospray ionization (ESI) technique(Fenn et al. , 1989) well compatible with tandem mass spectrometry
T
and the increasing improvement of resolving power, make LC-MS be the method of
RI P
choice in „omics‟ research, especially in high throughput analysis.
While LC-MS establishes itself as an indispensable technology, identifying
SC
metabolites from MS spectra is not amenable due to variation of experimental settings,
MA NU
for example chromatographic conditions, mass spectrometry parameters (Halket et al. , 2005).. This becomes even serious for discovering unknowns from large and complex metabolite space for example untargeted metabolomics analysis. Additionally, the fragmentation mechanisms during ionization in LC-MS platform under various
ED
activation energies are still unclear and the investigation of them is far behind that of EI. These factors leave the confident interpretation of MS spectra derived from
PT
different LC-MS and LC-MS/MS platforms a significant challenge. Fortunately,
CE
recent active studies have made remarkable advances in metabolite identification and several tools and various databases are publically available (see Table 1 and Table 2).
AC
In general, currently available tools are developed based on two aspects of LC-MS data: accurate mass together with other information like isotopic distribution and MS/MS spectra.
Insert Table 2 2.2.3.1 Structure inference by accurate mass and other information The ability of accurate measurement of m/z is one of the most important features of high resolution mass spectrometry, which has greatly facilitated the whole MS data analysis workflow. For metabolite identification, using accurate mass calculated from determined m/z is generally the first step (Holcapek, Jirasko, 2010) as it is the 24
ACCEPTED MANUSCRIPT simplest and most straight-forward. Either formula generation method or large compound database or metabolism network search can be adopted here. For formula
T
generation, all combinations of predefined elements with constraints of element
RI P
number and mass range are exhausted. A number of tools commercially or freely available have been developed to assist this (see Table 2). As expected, very large
SC
number of candidate formulas will be generated, especially for relatively large molecular mass. This makes it impracticable to obtain a single assignment of formula
MA NU
to each m/z solely basing on the accurate mass. Thus it becomes nontrivial to define rules to filter out those false positives.
Among all the developed rules, similarity checking in isotopic distribution is
ED
commonly accepted as the most critical criterion. And it has been demonstrated that most spurious formulas could be rejected under this checking (Erve et al. , 2009, Kind
PT
and Fiehn, 2007). Isotopes of an element are naturally stable variants of the element
CE
that differ in number of neutrons as well as natural abundances (represented as percentage of each isotope, e.g. natural abundances of
12
C and
13
C are 98.93% and
AC
1.07%, respectively). Different elements have distinct isotopic abundance distributions in nature. Therefore, theoretically, each elemental composition or formula has unique isotopic distribution. This is the basic principle in compound identification using isotopic distribution. Namely, by comparing instrument determined isotopic distribution to the simulated one, the formula candidates can be ranked with top ones being the most similar via so called spectral comparison(Wang and Cu, 2010) or rejected if the relative abundances (RIA) between the two distributions are unacceptably different. The exploration to precisely simulate isotopic distribution has been undertaken for decades and several tools are now freely available (Valkenborg et al. , 2012). If the resolution of an MS instrument is high 25
ACCEPTED MANUSCRIPT enough, formulas can be exclusively identified using the RIA of single element. For example, the number of element carbon could be accurately estimated by comparing 13
C only to mono-isotopic peak for it was well resolved
T
isotope peak generated by
RI P
from other [M+1]+ peak in FT-ICR-MS due to the fact that mass difference between 13
C- and15N-substituted peak is 0.00632 Da(Miura et al. , 2010). Similarly, the
SC
number of N or the presence of Cl, Br and S can be deduced from the fine structure of isotopic distribution(Kaufmann, 2010). This strategy is now extended and confirmed
MA NU
using higher resolution instrument for high throughput metabolomics analysis(Nagao et al. , 2014).However, recent investigations have shown that the accuracy of RIA measurements are highly dependent on type and resolution of an MS instrument, peak
ED
intensity, accurate mass and data handling methods, whereas high RIA measurement error appears in peaks with low signal to noise ratio (S/N), low m/z and presence of
PT
co-eluting species(Knolhoff et al. , 2014, Koch et al. , 2007, Weber et al. , 2011, Xu et al. , 2010). These will terribly mislead the identification results(Koch, Dittmar,
CE
2007).Unfortunately, no systematic evaluation of the influence of RIA measurement
AC
error on formulae inference is performed. While suggestion for eliminating this influence can be setting larger error tolerance during comparison (Weber, Southam, 2011), cautions still should be paid when using RIA to identify metabolite and additional information are required. The exhaustion of elemental compositions according to the input parameters, for example element types and accurate mass with allowed errors, tends to generate meaningless formula that unlikely appear in plant metabolites. Many of those formulas cannot be rejected by RIA criteria only. Hence it is necessary to check the formulas using other rules. The famous “Seven Golden Rules” was defined after statistically analyzing formulas extracted from Wiley and NIST02 mass spectral 26
ACCEPTED MANUSCRIPT database and the Dictionary of Natural Products(Kind and Fiehn, 2007) and has been demonstrated to be an efficient tool in metabolomics study. An updated version of
T
these rules is defined recently after analyzing large scale formulas in PubChem
RI P
database(Lommen, 2014).
Once formulas are determined or ranked, decoding them to known metabolites is
SC
subsequently performed, typically by searching large databases (Little et al. , 2012,
MA NU
Zhu et al. , 2013). The databases frequently adopted in metabolomics are listed in Table 1. Further constraints of compounds can be realized by prior biological knowledge using for example lists of expected metabolites of the analyzed organism. Since metabolites in biological sample are biochemically connected (e.g. chemical
ED
transformation) rather than randomly mixed(Breitling et al. , 2006), it is beneficial to map the metabolite candidates onto metabolism networks to gain confident
PT
identification(Gipson et al. , 2008, Rogers et al. , 2009, Weber and Viant, 2010).For
CE
example, MI-Pack maps mass spectral peaks onto KEGG network database(Ogata et al. , 1999) and uses rigidly defined mass error surface of mass differences between
AC
substrate-product pairs derived from the database for metabolite identification (Weber and Viant, 2010). Significant reduction of both false negatives and false positives is consequently obtained, respectively. This approach is advantageous not only for metabolite identification but also mining related subnetworks which represents the activity or functions of the metabolites, as has been demonstrated in recent works(Doerfler et al. , 2014, Li et al. , 2013a). LC-MS can detect ion series (so called satellite ions) of a metabolite generated by fragmentation reactions during ionization, including neutral losses, ions with different adducts (Brown et al. , 2009, Huang et al. , 1999). Other types of fragments for example artifact ions, background ions, multiply charged fragments and so on can 27
ACCEPTED MANUSCRIPT also be generated by LC-MS(Keller et al. , 2008). These fragments can derive a large number of false positives during metabolite identification using accurate mass. free
tools
including
PUTMEDID-LCMS(Brown
et
al.
,
T
Several
RI P
2011),CAMERA(Kuhl et al. , 2012), IDEOM(Creek et al. , 2012), MZedDB tool(Draper et al. , 2009) and MAIT(Fernandez-Albert et al. , 2014) can be applied to
SC
identify these fragments in data preprocessing. In addition to fragment interferences, another factor that can potentially disturb metabolite identification is the extraction of
MA NU
accurate mass since the algorithm to calculate it from profile peak which hidden in commercial workstation. Also, LC-MS instrument has systematic mass deviations during conversion of ion signals to mass spectrum representation(Savitski et al. , 2004)
ED
or in different experimental settings(Petyuk et al. , 2008).In proteomics, the improvement of mass accuracy has been well studied via calibration by peptide
PT
MS/MS spectra(Egertson et al. , 2012, Venable et al. , 2006) or correctly identified peptides(Petyuk et al. , 2010)and background ions(Haas et al. , 2006). The calibrated mass
provides
CE
accurate
superior
discrimination
between
false
and
true
AC
identifications(Haas, Faherty, 2006). The situation becomes more complex in metabolomics, however, because more fragments are produced accompanying a metabolite precursor ion and fragmentation of metabolites is more sophisticated than peptides. In a recent approach, mass accuracy was improved via background ions (Scheltema et al. , 2008). Whereas a great advance in mass accuracy is achieved in commercial software MassWorksTM (Cerno Bioscience) by peak shape calibration with the aid of internal standards which offers unit resolution instruments (e.g. ion trap instrument) the capability of compound identification using accurate mass and isotopic distribution as high resolution instruments (Kuehl and Wang, 2006). A correction of automatic gain control system calibrated by multiple linear regressions 28
ACCEPTED MANUSCRIPT can also obtain mass accuracy up to ppb (part per billion) level (Williams and Muddiman, 2007). Nevertheless, more studies on this topic are still required when
T
identifying metabolites especially in untargeted analysis.
RI P
2.2.3.2 Metabolite identification by MS/MS
SC
MS/MS is one of the best techniques for structure elucidation and has been widely applied in analytical fields. As an indispensable part of LC-MS system, ionized
MA NU
molecules selected by instruments are dissociated into charged or neutral pieces by hard ionization methods for example collision induced dissociation (CID) method. Recording all the charged fragments and intact molecule ions forms the MS/MS spectrum. This MS/MS spectra generation procedure indicates that the structure of a
ED
molecule can be unambiguously deduced from its MS/MS spectrum, and moreover,
PT
the strategies for interpretation of GC-MS spectra, i.e. library search and de novo analyzing, can be applied in this deduction. Therefore, several MS/MS spectral
CE
libraries as well as computational methods for spectral prediction or structure elucidation are developed (see Table1 and Table 2).
Since the experimental
AC
conditions (e.g. collision energy) in MS/MS analysis are not as standardized as in GC-MS analysis and sizes of currently constructed libraries are much smaller comparing to the whole metabolism or structure databases and other factors(Stein, 2012, Werner et al. , 2008), metabolite identification via spectral library search does not gain its popularity as is in GC-MS analysis. As a consequence, much more studies are focused on developing computational methods to interpret MS/MS spectra without querying spectral library. Algorithms employed in currently developed software for computational MS/MS can be categorized into three basic approaches, namely, mass spectrum prediction, in 29
ACCEPTED MANUSCRIPT silico fragmentation and de novo elucidation(Hufsky et al. , 2014). Mass spectrum prediction has been well studied in interpretation of EI spectra and is the basic and
T
one of the most important modules in peptide identification in hypothesis-driven
RI P
proteomics. Due to the enormous diversity of small compounds, accurate MS/MS spectra prediction is still a tough challenge. To predict MS/MS spectrum for a given
SC
structure, Mass Frontier extracts possible reactions that occur during fragmentation of this structure from its fragmentation reaction library as rules to predict the fragments
MA NU
and intensities. ACD/MS Fragmenter handles spectrum prediction in a similar way. While MetISIS uses machine learning algorithm to learn CID kinetics from lipid experimental MS/MS spectra to predict spectra for lipids in silico (Kangas et al. ,
ED
2012).
Instead of directly predicting mass spectrum, in silico fragmentation attempts to find
PT
out a structure from all candidates that best explain the given MS/MS spectrum. This
CE
approach was firstly employed in EPIC using bond disconnection algorithm to exhaust all possible substructures of a molecule and comparing the substructures to
AC
formulas inferred from fragment ions. Then relevant structures were listed for user confirmation(Hill and Mortishire-Smith, 2005). Later, FiD(Heinonen et al. , 2008)exhausted all substructures from molecule graph using depth-first graph traversal algorithm and matching them to fragment ions. All candidates were ranked according to the total bond dissociation energy (BDE) calculated from bond cleavages in each molecule candidate and the first rank has the least BDE. Similar procedure was implemented in Mass-MetaSite(Bonn et al. , 2010). MetFrag used a more complex procedure than the above algorithms to extract substructure-fragment pairs with additional consideration of
rearrangement reactions during molecule
fragmentation and scored each candidate using both matched fragments and 30
ACCEPTED MANUSCRIPT BDE(Wolf et al. , 2010). This algorithm was integrated into MetFusion(Gerlich and Neumann, 2013) recently as a complement of library search and vice versa. An
T
alternative procedure was implemented in FingerID via calculating the likelihood
RI P
between metabolites in database and a given experimental MS/MS spectrum in feature space called fingerprints using SVM model obtained by training fingerprints
SC
extracted from Mass Bank MS/MS spectral library(Heinonen et al. , 2012). Whereas CFM calculated the likelihood between database metabolites and given MS/MS
MA NU
spectra according to the competitive fragmentation process learned from spectral library by expectation maximum algorithm(Allen et al. , 2014a). De novo analysis, however, infers structure from the observed fragments in a given
ED
MS/MS spectrum. This approach firstly determines formulas of fragments according to their high resolution m/z and then deduces the structure of a precursor ion using
PT
these formulas and known fragmentation pathways that generate them. The most
CE
appropriated method that was employed for this deduction to date seems to be fragmentation tree with nodes being fragment formulas, edges being neutral losses
AC
and root being the precursor (Bocker and Rasche, 2008, Rasche et al. , 2011). Therefore, with appropriate scoring scheme, an experimental MS/MS spectrum can be identified by extracting the most optimal fragmentation tree defined by the scores. But later on this procedure is demonstrated to be extremely computationally intensive, even though formulae of precursor has been determined(Hufsky et al. , 2012). This obstacle can be partly solved by heuristic methods(Rauf et al. , 2012). Developments of much more efficient methods for high throughput analysis are still challenging.
2.3 Variable selection Biomarker screening (variable selection) plays an essential role in metabolomics 31
ACCEPTED MANUSCRIPT because biomarker identification aims to convert metabolomic results into valuable biological knowledge. It has been developed for decades and is active in various
T
research fields, such as statistical pattern recognition (Mitra et al. , 2002), machine
RI P
learning (Robnik-Šikonja and Kononenko, 2003), data mining (Liu and Motoda, 1998) and statistics (Miller, 2002). Moreover, it has been proven to be effective in both
SC
theory and practice in improving learning efficiency, enhancing predictive accuracy and explanation of learned results (Yu and Liu, 2004, Yun et al. , 2013). Nowadays,
MA NU
high-throughput chemical data generated from modern analytical platforms such as GC-MS, LC-MS and NMR usually contain a large number of data points (variables) while the samples is relatively less, so called "large p, small n problem" in statistical
ED
learning (Boccard et al. , 2010). Actually, variable selection is an optimization problem to find an optimal variable combination from the considerable body of
PT
variables. However, it faces a great challenge to address this NP-hard problem. So far, a lot of chemometricians or statisticians have proposed a great deal of variable
CE
selection methods specific to this problem. Some are based on statistical features of
AC
variables such as uninformative variable elimination (UVE) (Centner et al. , 1996a), Monte Carlo based UVE (MC-UVE) (Cai et al. , 2008), competitive adaptive reweighted sampling (CARS) (Li et al. , 2009a, Zheng et al. , 2012) , iterative predictor weighting (IPW) (Forina et al. , 1999), successive projection algorithm (Araújo et al. , 2001) and Bayesian linear regression (BLR) (Chen and Martin, 2009). Some are based on the optimization algorithm, such as rough set (Swiniarski and Skowron, 2003), particle swarm optimization (PSO) (Wang, Yi, 2008), stepwise selection (H Martens, 1989), forward selection (Blanchet et al. , 2008, H Martens, 1989), backward elimination (H Martens, 1989, Sutter and Kalivas, 1993), genetic algorithm (GA) (Leardi, 2000, 2001, Yang and Honavar, 1998, Yun et al. , 2014a) and 32
ACCEPTED MANUSCRIPT simulated annealing (SA) (Kalivas et al. , 1989). We here divide them into two kinds of directions: variable ranking and variable subset selection (Narsky and Porter,
RI P
T
2013).
2.3.1Variable ranking
SC
Variable ranking is mostly used in revealing informative metabolites or biomarkers. Ranking assigns a measure of importance to each variable based on some certain
MA NU
criteria. This measurement is usually with a nonnegative value indicating the importance of a variable. PLS is a basic tool of chemometrics. Many PLS-based criteria are frequently employed to assess the importance of variables and rank the variables, especially in building a partial least squares-discriminant analysis (PLS-DA)
ED
classification model. The most popular criteria include PLS loading weights (LW) (Wold et al. , 2002), variable importance on projection (VIP) scores (Favilla et al. ,
PT
2013, Wold, Sjöström, 2002), and regression coefficient (RC) (Wold et al. , 2001a)
CE
and target projection (TP) (Kvalheim, 2010, Rajalahti et al. , 2009b). PLS loading weights can be used as a measure of variables from the fitted PLS model for each
AC
principal component (latent variable), and different principal components generate different ranking results. VIP is to represent the importance of each variable being reflected by loading weights from each latent variable of PLS. Some researchers suggested that a variable should be retained if VIP score is greater than 1 (Chong and Jun, 2005, Gosselin et al. , 2010). However, as the determination of a threshold is one of the most difficult steps in variable selection of metabolomics data even now, this criterion needs further verification. RC uses the regression coefficients of PLS modeling. It just measures the association between the single variable and its response. The variables that have small absolute value of regression coefficient could be eliminated as uninformative (Centner, Massart, 1996a). TP provides a projection of 33
ACCEPTED MANUSCRIPT RC on the X matrix, so that the target-projected loadings are proportional to the product of RC and the covariance matrix (XTX) (Kvalheim and Karstang, 1989,
T
Kvalheim et al. , 2009). Selectivity ratio (SR) measures the ratio between explained
RI P
variance and residual variance of each variable after TP, and it has been used quantitatively to select biomarker candidates (Rajalahti et al. , 2009a, Rajalahti,
SC
Arneberg, 2009b). In addition, variable ranking can be conducted based on statistical features between variables and classification label, such as correlation (Hall, 1999),
MA NU
information gain (Ben-Bassat, 1982), Euclidean distance (Liang et al. , 2008) and mutual information (Yu and Liu, 2004). This kind of methods only provides a measure of importance on a single variable (i.e. single metabolite) but without
ED
considering the interaction effects among multiple variables.
2.3.2 Variable subset selection
PT
The ranking of variables by their importance does not really tell us which variables or
CE
how many variables should be discarded. Although variable ranking is simple and time-saving, it works with low efficiency in identifying the optimal subset of
AC
variables. Subset selection is to seek an optimal subset from all variables that satisfy optimality criteria. Any variable ranking method can be turned into a variable subset selection algorithm by introducing a threshold on variable importance values. Those variables with importance values above this threshold are kept, while the variables with those below this threshold are eliminated. The choice of this threshold can be subjective or conducted by statistical method (Narsky and Porter, 2013). Usually, a trade-off between the model prediction accuracy and the number of selected variables is considered. The most straightforward proposal for this is to use a cross validation (CV) procedure to determine the threshold by estimating the generalization error according to the number of variables and choose the number which minimizes the 34
ACCEPTED MANUSCRIPT prediction error. When the variables are ranked by some criteria, the model can be built by adding these sorted variables one by one until all of them are included. The
T
best variables subset has the lowest CV error. For example, when adding the fifth
RI P
variables to build the model, there should appear the lowest prediction error of CV. Thus the first five variables constitute the best variable subset. For subset selection,
SC
some criteria related to the classification algorithm are employed. The objective function is a pattern classifier, which evaluates variable subsets by their predictive
MA NU
accuracy on test data by statistical re-sampling or CV. Moreover, the optimization algorithm is usually combined with the classification algorithm. In brief, variable subset selection seeks the subset that is optimal or near-optimal with respect to an For example, genetic algorithm-partial least squares-discriminant
ED
objective function.
analysis (GA-PLS-DA) is subjected to combine optimization algorithm GA with a
PT
classifier. GA was based on the Darwin‟s classical rules about natural evolution: the best individual or the generation produced by mating of the best individuals, leading
CE
to a better offspring and have a higher chance to survive in the living environment.
AC
This combination has successfully been applied to the classification of the normal and pre-cancer tissue samples (Cao et al. , 2012). In this work, PLS-DA served as a classifier building on each subset generated by GA. Besides, particle swarm optimization coupled with support vector machine (PSO-SVM) (Alba et al. , 2007) was designed for selecting variable subsets as solutions in order to reduce the high dimensionality of variables for subsequent classification. The SVM classifier is employed whenever the fitness evaluation of a temporary variable subset is required. Compared with variable ranking method, subset selection generally achieves better prediction accuracy since it turns to the specific interactions between the classifier and dataset, with a mechanism to avoid overfitting using re-sampling or cross-validation 35
ACCEPTED MANUSCRIPT measure of prediction accuracy. However, it has to train a classifier for each variable subset, leading to low execution and high computation. And the solution is lack of
RI P
T
generality since it ties up the bias of the classifier in the fitness evaluation function.
2.3.3 Variable selection considering the interaction effect among variables
SC
In fact, to find an optimal subset or ranking of variables is not always a favorite case unless the interaction among multiple variables is considered. The combination effect
MA NU
among variables should be considered since the joint performance of a set of variables is better than the additive independent contributions of its individuals (Anastassiou, 2007). To address the variable interaction in subset selection effectively and efficiently, Zhao and Liu introduced a variable subset selection method, called
ED
INTERACT, which is based on inconsistency and symmetrical uncertainty measurements for finding interacting features (Zhao and Liu, 2009). They proposed
PT
that variable interactions could be implicitly coped with a carefully designed variable
CE
evaluation metric and a search strategy with a specially designed data structure, which together considered combination effects among variables when performing variable
AC
selection. The method proposed in Breiman‟s work (Breiman, 2001) have taken into account the combination effect among numerous variables to some extent according to random forest and permutation test. The variable importance is assessed by the percent increase of misclassification error when the variable is permuted randomly in random forest. However, all variables are involved in the model of random forest, so it is difficult to provide a good reflection on the synergetic effect among multiple variables. Recently, Li and Liang proposed a new strategy for variable selection, called model population analysis (MPA) (Li et al. , 2010a). It provides a general framework for development of data-analysis methods. Figure 5 illustrates the outline of MPA idea. 36
ACCEPTED MANUSCRIPT There are three steps in MPA: (1) Monte Carlo sampling (MCS) is employed to randomly produce N sub-datasets (e.g., 10,000); (2) A sub-model is built on each
T
sub-dataset; and, (3) Statistical analysis is employed to evaluate an outcome of
RI P
interest (e.g., prediction errors) for all the established N sub-models. With this approach, the variables are identified as informative, uninformative and interfering
SC
variables (Wang et al. , 2011) based on the differences among the cases and control samples, respectively. Figure 6 illustrates the prediction error distributions of the three
MA NU
kinds of variables after permutation. The prediction error of informative variables increased after permutation, implying that they could significantely improve the prediction performance of the classification model. As to the uninformative variables,
ED
no significant difference emerged before and after permutation. They performed like noises. As for the interfering variables, their prediction errors significantly decreased
PT
after permutation, indicating that this kind of variables may bring negative impact on the model and influence the classification. Uninformative and interfering variables are
CE
useless as they may have a bad influence on the modeling. Thus, discovering the
results.
AC
optimal variables subset or ranking in the informative variables can output compelling
Insert Figure 5 Insert Figure 6
Subwindow permutation analysis (SPA) (Li et al. , 2010b), a variable ranking method, combines above mentioned ideas with the Monte Carlo sampling (MCS) method and MPA. It assesses each variable‟s importance based on the sub-models obtained by MCS technique. Firstly, each sub-dataset, so called sub-window, is generated from the whole data through MCS in not only sample but also variable space. Secondly, the software builds up a sub-model on each sub-dataset and each permutation of this 37
ACCEPTED MANUSCRIPT sub-dataset. Thirdly, the differences between the prediction errors of normal and permutated sub-window are distinguished for each variable, respectively. If a large
T
number MCSs are performed, two distributions of prediction errors can be obtained
RI P
for each variable. Finally, informative variables are identified and ranked based on the P value of Mann-Whitney U test (Mann and Whitney, 1947) on these two distribution.
SC
Besides, margin influence analysis (MIA) (Li et al. , 2011) is also based on the idea of MCS and MPA. Although being designed to work with SVM for identifying
MA NU
informative variables, this method also gives a measure for each variable according to the differences between the prediction errors of inclusion and exclusion of this variable. However, the chance of each variable to be sampled by MCS is not the same.
ED
With the condition that some variables are selected more frequently and some are less, it is not appropriate to assess the importance of each variable using the above
PT
introduced strategy. To address this problem, a new sampling method in the variable space, called binary matrix sampling (BMS) (Zhang et al. , 2012a), was proposed.
CE
This method not only considers the synergetic effect among multiple variables, but
AC
also guarantees that each variable is selected with equal probability and a population of different variable combinations is generated as well. With this population of variable subset, Yun etc. introduced a method called iteratively retains informative variables (IRIV) (Yun et al. , 2014b), to employ MPA strategy and find the optimal subset of variables through observing the differences between the prediction errors of inclusion and exclusion of each variable. Deng etc. developed an optimization algorithm called variable iterative space shrinkage approach (VISSA) to search for the optimal variable combinations (Deng et al. , 2014). Each variable is assigned a weight according to its importance in modeling in VISSA. The weight of each variable accumulates through an iterative procedure and the variables are selected when their 38
ACCEPTED MANUSCRIPT weights reach “1”. Two rules are highlighted in the VISSA algorithm. First, the variable space shrinks smoothly in each step. Second, the variable space is optimized
T
in each step.
RI P
Although SPA, MIA and IRIV have considered the synergetic effect among multiple variables, they rarely investigate the complementary information between variables.
SC
Variable complementary network (VCN) is an overall method to visualize the complement process among biological variables (Li et al. , 2012). It accumulates
MA NU
information of several classification models obtained by MCS in variable space, and quantitatively computes the complementary information among variables and then effectively discovers biomarker with the help of mutual associations of metabolites.
ED
To clearly show the above mentioned method, Table 3 lists several variable selection methods based on whether considering the interaction effects among variables or not.
PT
Insert Table 3
CE
2.4 Modeling of the data
AC
To explore the high-dimensional metabolomics datasets and discover valuable information on biological events, a number of machine learning methods are developed for modeling. Usually, these methods start with a blind and unsupervised data exploration and continue with supervised analysis in which a priori knowledge of data structure is utilized. Main characteristics of the machine learning methods which will be described below are summarized in Table 4. It contains the category, advantages and disadvantages of each method, and also some applications in metabolomics. Insert Table 4 39
ACCEPTED MANUSCRIPT 2.4.1 Unsupervised methods
T
Unsupervised methods are usually used to explore the overall structure of a dataset,
RI P
finding trends and groupings within the dataset. These methods contribute an unbiased view of the data. Several unsupervised methods are available, among them
SC
principal component analysis (PCA), hierarchical cluster analysis (HCA) and
MA NU
self-organization mapping (SOM) are the most frequently used examples in metabolomics.
PCA is one of the most popular multivariate statistical analysis method in metabolomics (Pearson, 1901). The purpose of PCA is to obtain a linear
ED
transformation of the high dimensional variables into a small number of factors,
PT
called principal components (PCs). This transformation defines that the first PC has
CE
the largest variance, and each following PC has the largest variance in turn under the constraint of being orthogonal to the preceding PC. PCA produces two matrices
AC
known as scores (i.e. PCs) and loadings. Scores show the new coordination of the samples. Loadings represent the method in which the original variables are combined into PCs linearly. The distribution of samples could be visualized by PCA using a scores plot, which demonstrates the projection of samples on a plane spanned by the first and second PCs. PCA is a suitable method to summarize high-dimensional data. However, this method is unable to find the optimal direction or pattern of variables which can separates classes of objects best. Yi et al. employed the PCA method to represent the metabolic footprints of tangerine peels successfully (Yi et al. , 2009, Yi et al. ). The idea and main results of volatile metabolic footprinting are shown in 40
ACCEPTED MANUSCRIPT Figure 7 (Yi, Dong). In this study, based on the tangerine peels‟ metabolic footprints, characteristic secondary metabolites were screened out, such as D-limonene and
RI P
T
linalool. In addition, compounds such as 4-carene, 3-carene, β-pinene and γ-terpinene were screened as major components for the pungent smell of Pericarpium Citri
SC
Reticulatae Viride (PCRV). Geranyl acetate, farnesyl acetate and three alcohols (6-hepten-1-ol, 3-methyl-1-hexanol, 1-octanol) were for the pleasant odor of
MA NU
Pericarpium Citri Reticulatae (PCR). The results indicated that plant metabolomics analysis focusing on ripening process will be an effective strategy for revealing the chemical features of closely related herbal medicines, such as PCR and PCRV, and is
ED
helpful for quality control of them.
PT
Insert Figure 7
HCA is another widely used unsupervised method in modeling of metabolomics
CE
(Webb, 2003). HCA aims to group samples that are relatively similar and the
AC
relatively dissimilar objects will be in another cluster. The input of HCA is a distance or a dissimilarity matrix (e.g. Euclidian, Mahalanobis or Minkowski distances) that represents the dissimilarities among samples. The choice of distance matrix exhibits significant influence on the clustering structure. Clearly, HCA works the best only when a hierarchical structure is available. The HCA clusters the data forming a tree called dendrogram. It stands for the similarities and differences among objects in a two-way structure. When HCA is used for classification, similarity cut-off should be decided firstly. It can separate the dendrogram into different clusters. HCA cannot give us the information about why a certain clustering is obtained. That is to say, 41
ACCEPTED MANUSCRIPT HCA cannot identify which metabolites are corresponding to the clusters‟ differences. It is the main drawback of HCA.
RI P
T
SOM is one of neural-network algorithms belonging to unsupervised-learning category (Kohonen and Maps, 1995). For a high-dimensional data, SOM can form a
SC
nonlinear projection on a regular, low-dimensional grid. The clustering in the data
MA NU
space and the metric-topological relations of the data items is clearly visible.
2.4.2 Supervised methods
Supervised techniques support a priori known structure of the data to train patterns
ED
and rules, using to predict new data. A wide range of supervised methods has been employed in metabolomics. The advantage of these methods is that they can provide
PT
variables information about their discrimination ability between two or more classes
CE
when modeling; and therefore, they are widely used in metabolomics for biomarker
AC
screening. Supervised techniques can be classified as linear methods such as partial least squares-discriminant analysis (PLS-DA), linear discriminant analysis (LDA), orthogonal projections to latent structures discriminant analysis (OPLS-DA), and non-linear methods such as random forest (RF) and support vector machines (SVM).. LDA tries to find a linear function on the basis of original variables, which maximize the ratio of between-class variance and minimize the ratio of within-class variance. (Webb, 2003). It is fast and powerful and parameters optimization is not necessary. However, several limitations exist. LDA uses a between-class covariance matrix. Therefore, it is not always appropriate if the variance structure differs from various
42
ACCEPTED MANUSCRIPT classes. Besides, the number of samples needs to be larger than that of variables so that the inverse of the covariance matrix is obtainable (Bishop, 2006). LDA is
RI P
T
particularly fitting for the data structure that multi-collinearity of compounds is not serious. For metabolomics dataset, it is common that the number of samples is less
SC
than that of variables. In these cases, LDA cannot be applied directly. A possible solution is to reduce the dimension of variables before LDA. For example, we can use
MA NU
an unsupervised method such as PCA for variable reduction firstly, then, apply LDA on the relevant PCs. The number of PCs can be optimized by cross-validation (Stone, 1974, Wold, 1978).
ED
The most widely used supervised method for classification is partial least squares-
PT
discriminant analysis (PLS-DA) (Barker and Rayens, 2003), which is a combination of partial least squares (PLS) regression (Wold et al. , 2001b) and LDA. This
CE
technique is also a latent variable extraction approach, which is similar to PCA. The
AC
assumption is that the data could be well approximated in a lower dimensional subspace by latent variables (LVs). These LVs are assumed to linear combined by original variables. The first PC (PC1) is obtained in the direction with the highest variance of the data,.The first LV (LV1) of PLS-DA lies in the direction explaining most information of between-class variation for the objects. PLS-DA can deal with the highly collinear data. It is a very important advantage of this method. And it is also suitable for spectrometric data. This method has engrained in most commercial chemometrics software but still poorly understood by most of the users (Brereton and Lloyd, 2014). For instance, the projection plot of PLS-DA (scores‟ plot) is very 43
ACCEPTED MANUSCRIPT popular for classification in metabolomics because it separates the different classes from, an overoptimistic view (Westerhuis et al. , 2008). We must admit that there
RI P
T
are some pitfalls when using PLS-DA to model the data of unequal class sizes (Brereton and Lloyd, 2014). However, PLS-DA can provide excellent insights into the
SC
cause of discrimination via weights (Hoskuldsson, 2001), loadings, regression coefficients (Centner et al. , 1996b), VIP (Wold et al. , 1993) and selectivity ratio (SR)
MA NU
(Rajalahti et al. , 2009c); and therefore has become a useful tool for biomarker discovery.
The recent modification of PLS-DA is orthogonal projections to latent structures
ED
discriminant analysis (OPLS-DA) (Trygg and Wold, 2002). The systematic variations
PT
in data matrix X are split into two parts via the orthogonal signal correction (OSC) technique (Wold et al. , 1998): one part exhibits linear responsiveness to response and
CE
another is linearly orthogonal to response. For OPLS-DA, only variation related to the
AC
response is useful for modeling. It is important to note that OPLS-DA has similar prediction results with PLS-DA (Tapp and Kemsley, 2009). But OPLS-DA has better visualization and interpretation ability since fewer latent variables are required to explain the same variation of the data compared to PLS-DA (Verron et al. , 2004).
2.4.3 Non-linear methods
The above introduced supervised methods are very well established to investigate linear relationship between variables and the class labels. However, these approaches are not suitable for investigating the serious nonlinear datasets that may present in
44
ACCEPTED MANUSCRIPT intricate biological systems. Because complex interactions occurring in different levels of biological organizations, it is common that biological processes following a
RI P
T
nonlinear response. In these cases, non-linear pattern recognition methods are required to characterize metabolomics data. Many non-linear techniques have been
SC
proposed in pattern recognition and machine learning research fields. Among them,
methods used in metabolomics.
MA NU
kernel-PLS, random forest (RF) and support vector machines (SVM) are three popular
Kernel-based models transformed the data via some specific functions, the kernels. Using the kernel transformation, the nonlinear problem of the original data is
ED
transformed into a higher-dimensional feature space. After that, the nonlinear problem
PT
becomes linear and can be solved easily. The kernel functions have various types and users can choice the suitable kernel transformation for certain dataset. Positive
CE
semi-definite is one of the requirements of the kernel matrix and many kernel
AC
functions are capable for it (Shawe-Taylor and Cristianini, 2004). Dot product is the simplest kernel function for the data matrix. Radial basic function is another frequently used kernel function, which requires tuning of parameters relating to the width of Gaussian. Kernel based classification methods such as kernel Fisher discriminant analysis (K-FDA) (Cao et al. , 2011, Scholkopft and Mullert, 1999), kernel PLS (K-PLS) (Walczak and Massart, 1996) and kernel OPLS (KO-PLS) (Bylesjo et al. , 2008) were developed and they all exhibited obvious advantages in solving nonlinear problems. Support vector machine (SVM) is a powerful kernel based classifier which makes use 45
ACCEPTED MANUSCRIPT of a set of objects, named support vectors, to define decision boundaries separating different classes (Vapnik, 1998). SVM focus on finding a hyper-plane that splits two
RI P
T
classes perfectly, while the thickness of the margins is maximized. So that, for each class, the distance of the plane to the data point is the closest (Li et al. , 2009b, Luts et
SC
al. , 2010). If a point stands on the wrong side of the margin, the margin is maximized by penalizing the point.. The step can split the overlapping classes. Support vectors
MA NU
are the points which are on the boundary or on the wrong side of the margin supporting the split. When the classes are separated by a non-linear boundary, the kernel method is used to find the boundary. SVM is particularly suitable for the data
ED
of small sample size. And it is capable to handle both linear and nonlinear problems of
PT
classification by applying linear and non-linear kernels. The major disadvantage of SVM is that it does not provide a universal way of solving non-linear problems.
CE
Hence, the kernel functions should be selected discreetly (Burges, 1998).
AC
Random forest (RF) (Breiman, 2001) is an ensemble learning method consisting of a large number of classification trees (Breiman et al. , 1984). It is one of the most powerful classifiers for high dimensional data (Scott et al. , 2013). A bootstrap method is used to select samples with replacement from the original samples (so called bootstrap samples) for training classification trees. All trees in the forest are grown to the maximum size, without pruning. Two machine learning techniques, bagging and random feature selection are employed in RF. They are both powerful and efficient techniques. For bagging, each tree is trained on the bootstrap samples of the training dataset. Predictions are obtained from the majority votes of the trees. When RF 46
ACCEPTED MANUSCRIPT constructs an individual tree, not all training samples are used. So, a set of out-of-bag (OOB) samples exist, which could be applied to gain the validated classification
RI P
T
accuracy. In RF, the variable importance is measured by permuting the variable randomly but keeping all other variables fixed and computing the classification
SC
accuracy loss in estimation of OOB samples. It is defined as the average accuracy loss over all trees and all samples in the forest. Because RF performs variable selection
MA NU
simultaneously during classification, it is suitable for high dimensional data where irrelevant variables exit. It has been proved that RF gained much better performance than many of the classifiers such as PLS-DA and OPLS-DA with external validation.
ED
Unfortunately, it does not draw enough attention in metabolomics (Scott, Lin, 2013).
PT
2.4.4 Model tuning and model validation
CE
The tuning of parameters is of great importance when building a model. For example,
AC
one of the crucial step in PCA and PLS is to optimize the number of components.. Many methods can be used for model tuning including the Mallows' Cp statistics (Mallows, 1973), Akeike information criterion (AIC) (Akaike, 1974), Bayesian information criterion (BIC) (Schwarz, 1978) and cross-validation (CV) (Stone, 1974, Wold, 1978). Among them, CV is the most commonly used method because it chooses a model from the prediction ability point of view. The simplest CV method is leave-one-out CV (LOOCV) in which one sample is left out successively for prediction while the others are used for training. However, it is reported that LOOCV tends to select large models if only the prediction error is used (Shao, 1993). One
47
ACCEPTED MANUSCRIPT solution is to use different sample partition way, such as K-fold CV (Geisser, 1975) and Monte Carlo CV (MCCV) (Shao, 1993).
RI P
T
Model validation is a process of deciding whether the results quantify hypothesized relationships between the variables and the responses and provide accurate estimation
SC
of the model prediction ability. Supervised machine learning methods such as PLS-DA have high tendency of over-fitting especially on high dimensional data
MA NU
(Brereton, 2006, Li et al. , 2010c). Thus a careful model validation is desired. Figure 8 shows an example of CV for model validation. The four data sets with random values are simulated on computer. Each data set has 100 samples and the number of variables
ED
is set to 5, 50, 500 and 5000, respectively. For each sample, the class label is
PT
randomly assigned. PLS-DA is used to classify the samples. In PLS scores plots (Fig. 8 (A)), the two groups can be separated, and with the increase of variables the
CE
separation gets better clearly, indicating the presence of over-fitting. While in the PLS
AC
CV plots (Fig. 8 (B)), the two groups cannot be separated, suggesting that the model has no predictive ability and should not be used for prediction. Insert Figure 8
There are several criteria that can improve the prediction ability of a model such as sensitivity, specificity, accuracy, receiver operating characteristic (ROC) curve and Q2. Sensitivity, i.e. the true positive rate, is the proportion of the actual positive samples which are correctly identified as positives. Specificity, i.e. the true negative rate, is the proportion of the actual negative samples which are correctly identified as negatives. Accuracy of a classification model is the rate that how many objects are correctly 48
ACCEPTED MANUSCRIPT classified. A ROC curve is the plot of sensitivity versus 1-specificity at different classification boundaries. For a perfect classification, the value of specificity should
RI P
T
be close to 1, and 1-specificity should be preferably close to 0. The ROC curve is a method to describe the sensitivity and specificity of a classification model at different
SC
classification boundaries. The area under this curve (AUC) is finally used to quantify the performance of this method. The AUC is closer to 1, the method is better performs.
MA NU
The prediction error measurement Q2 measures how well these class labels could be predicted for the new data. It is defined as follows:
ED
Q2
(y 1 (y
i
yˆ i ) 2
i
yi ) 2
i
i
PT
where yˆ i denotes to the predicted value of sample i, while yi denotes to the mean value of y for all samples. If all the samples are predicted very well, Q2 is close to 1.
CE
In CV, model tuning and model validation is carried out simultaneously. When the
AC
optimal model parameter is determined, the characteristics of prediction ability such as Q2 are obtained by tuning parameters. However, this strategy may provide over optimistic results of the model prediction ability (Krstajic et al. , 2014). A more systematic way is to use double CV (DCV) (Filzmoser et al. , 2009, Stone, 1974). It consists of two loops, the inner loop is used for model tuning and outer loop is used for model validation. The samples used for prediction are participated in model tuning. DCV has showed more accurate estimation of the error rate than 6-fold CV (Westerhuis, Hoefsloot, 2008). The ideal situation of model validation is to use independent test set which is assumed 49
ACCEPTED MANUSCRIPT to be representative, independent from the training data. There are a number of algorithms to divide samples into training set and test set, including the Duplex
RI P
T
algorithm (Snee, 1977), Kennard-and-Stone (KS) algorithm (Kennard and Stone, 1969) and SPXY algorithm (Galvao et al. , 2005). However, the ideal situation is
SC
usually unsatisfied in real conditions and therefore the results of test set should be biased.
MA NU
Permutation is another way for validating a model. The classification ability between the established classification model and other random classification models are compared. It evaluates whether the former is significantly better than the latter
ED
(Golland et al. , 2005). , The class labels of samples are permutated in a permutation
PT
test. The rationale behind permutation test is that the model with wrong class labels cannot predict the classes very well. By repeating the permutation test a large number
CE
of times, a group of "wrong" models are built and the distribution for accuracy, Q2 and
AC
AUC can be obtained. For a validated model, the difference between the "right" models and the "wrong" models should be significant. This can be characterized through a statistical hypothesis testing. And it has many applications in metabolomics studies (Blaise et al. , 2013, Huang et al. , 2013). Modeling of metabolomics data is a systematic work. For exploratory studies, unsupervised method such as PCA provides an informative first-look at the dataset structure and relationships between groups. Then supervised methods such as PLS-DA and OPLS-DA are applied to classify the samples as well as biomarker discovery. When these classifiers work improperly, non-linear models, SVM and RF, 50
ACCEPTED MANUSCRIPT are applied to further explore the non-linear relationship within the data. In addition,
RI P
with caution to ensure its prediction ability for future samples.
T
the parameters of each model should be well tuned and the model should be validated
SC
2.5. One eye on the future
Compared to animals, plants have an extremely wide diversity of metabolites. It is
MA NU
estimated that there are more than 200,000 metabolites presenting in the plant kingdom.(Oksman-Caldentey and Saito, 2005). So far, numerous authors have demonstrated that data analysis based on an individual dataset exhibited limitations
ED
for grasping the chemical complexity of plant metabolome. A large amount of data and information is generating from numerous experimental platforms (e.g. NMR,
PT
GC-MS or LC-MS). Consequently, information combination becomes more and more
CE
necessary and important to extend the metabolites‟ coverage and characterize a
AC
biological system (Boccard and Rudaz, 2014). The greatest future challenge is how to efficiently integrate the mess information from various sources, i.e. the data fusion problem. Merging information from multiple datasets with different structure characteristics and extracting the common or distinctive features will unquestionably form a crucial element for the more comprehensive prospect of plant metabolomics. More and more papers have been published to discuss the problem of data fusion since 2005 (Boccard and Rudaz, 2014, Smilde et al. , 2005). A further fusion includes the integration with various “Omics” fields, such as genomics, transcriptomics and proteomics. They are all effective strategies to describe a whole biological system.
51
ACCEPTED MANUSCRIPT However, we should be careful to avoid the network discordance when metabolomics
T
are integrated with other “Omics”.(Fernie and Stitt, 2012).
RI P
3. Conclusions
SC
In summary, metabolomics, as a fundamental biotechnology, plays an essential role in
MA NU
basic research for elucidating environmentally effects, gene functions, and defining cellular processes. So far, it still needs to exercise with cautious about the data acquisition, processing and information interpretation due to numerous limitations related to data analysis of plant metabolomics. We here emphasize four questions
ED
which are of great importance to the advance for data analysis of plant metabolomics.
PT
1) Automatic and effective data preprocessing: this development is still a
CE
hard-to-achieve task up to now, especially for detection, alignment, and deconvolution of peaks with low responses. 2) NP-hard problem in variable selection:
AC
to address this question is an attractive but intractable mission for all of researchers. 3) Confidently identification of an unknown metabolite from complex MS spectrum data still remains great challenge. 4) Model validation: new efficient model validation methods and indexes are urgently desired. And, they should be carefully selected in practice to guarantee that the objective models are fully validated and with good prediction ability for future real samples. All these questions together with the high-dimensional characteristics of metabolomics datasets poses lots of fundamental questions in chemometrics, facing enormous challenges on chemometrics to develop robust and efficient methods to answer various biological questions derived from 52
ACCEPTED MANUSCRIPT metabolomics. We hope that this review can provide a guide for practitioners of plant metabolomics, provide insights with regard to its present use and applications of data
RI P
T
analysis.
Conflicts of interest statement The author declares no conflicts of interest.
SC
Acknowledgements
MA NU
This work was supported financially by National Nature Foundation Committee of P.R. China (No.21465016, No.21105129, No.91127024, and No.21473257), Science and Technological Program for Dongguan‟s Higher Education, Science and Research, and Health Care Institutions (2012108102032).
ED
References
Akaike H. A new look at the statistical model identification. Automatic Control, IEEE Transactions on.
PT
1974;19:716-23.
Alba E, Garcia-Nieto J, Jourdan L, Talbi E. Gene selection in cancer classification using PSO/SVM and 284-90.
Evolutionary Computation, 2007 CEC 2007 IEEE Congress on2007. p.
CE
GA/SVM hybrid algorithms.
Allen F, Greiner R, Wishart D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics. 2014a:1-13.
AC
Allen F, Pon A, Wilson M, Greiner R, Wishart D. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res. 2014b;42:W94-9.
Allwood JW, Ellis DI, Goodacre R. Metabolomic technologies and their application to the study of plants and plant–host interactions. Physiologia Plantarum. 2008;132:117-35. Allwood JW, Goodacre R. An introduction to liquid chromatography–mass spectrometry instrumentation applied in plant metabolomic analyses. Phytochemical analysis. 2010;21:33-47. Allwood JW, Parker D, Beckmann M, Draper J, Goodacre R. Fourier transform ion cyclotron resonance mass spectrometry for plant metabolite profiling and metabolite identification.
Plant Metabolomics:
Springer; 2012. p. 157-76. Anastassiou D. Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology. 2007;3:n/a-n/a. Andreev VP, Rejtar T, Chen H-S, Moskovets EV, Ivanov AR, Karger BL. A universal denoising and peak picking algorithm for LC-MS based on matched filtration in the chromatographic time domain. Analytical chemistry. 2003;75:6314-26. Araújo MCU, Saldanha TCB, Galvão RKH, Yoneyama T, Chame HC, Visani V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometr Intell Lab. 53
ACCEPTED MANUSCRIPT 2001;57:65-73. BaniMustafa AH, Hardy NW. A Strategy for Selecting Data Mining Techniques in Metabolomics.
Plant
Metabolomics: Springer; 2012. p. 317-33. Baran R, Kochi H, Saito N, Suematsu M, Soga T, Nishioka T, et al. MathDAMP: a package for differential
T
analysis of metabolite profiles. BMC bioinformatics. 2006;7:530. Barker M, Rayens W. Partial least squares for discrimination. J Chemometr. 2003;17:166-73.
RI P
Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, et al. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics. 2006;22:1902-9.
SC
Ben-Bassat M. Pattern recognition and reduction of dimensionality. Handbook of Statistics. 1982;2:773-910.
Benecke C, Grund R, Hohberger R, Kerber A, Laue R, Wieland T. Molgen(+), a Generator of
MA NU
Connectivity Isomers and Stereoisomers for Molecular-Structure Elucidation. Analytica Chimica Acta. 1995;314:141-7.
Benton H, Wong D, Trauger S, Siuzdak G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Analytical chemistry. 2008;80:6382-9. Bertini I, Luchinat C, Miniati M, Monti S, Tenori L. Phenotyping COPD by 1H NMR metabolomics of exhaled breath condensate. Metabolomics. 2014;10:302-11.
Bertrand S, Bohni N, Schnee S, Schumpp O, Gindro K, Wolfender J-L. Metabolite induction via
ED
microorganism co-culture: A potential way to enhance chemical diversity for drug discovery. Biotechnology Advances. 2014;32:1180-204. Biais B, Bernillon S, Deborde C, Cabasson C, Rolin D, Tadmor Y, et al. Precautions for harvest, sampling,
PT
storage, and transport of crop plant metabolomics samples. 51-63.
Plant Metabolomics: Springer; 2012. p.
CE
Bishop CM. Pattern recognition and machine learning: springer New York; 2006. Blaise BJ, Gouel-Cheron A, Floccard B, Monneret G, Allaouchiche B. Metabolic Phenotyping of Traumatized Patients Reveals a Susceptibility to Sepsis. Analytical Chemistry. 2013;85:10850-5.
AC
Blanchet FG, Legendre P, Borcard D. Forward selection of explanatory variables. Ecology. 2008;89:2623-32.
Boccard J, Rudaz S. Harnessing the complexity of metabolomic data with chemometrics. Journal of Chemometrics. 2014;28:1-9. Boccard J, Veuthey JL, Rudaz S. Knowledge discovery in metabolomics: an overview of MS data handling. Journal of separation science. 2010;33:290-304. Bocker S, Letzel MC, Liptak Z, Pervukhin A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics. 2009;25:218-24. Bocker S, Rasche F. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics. 2008;24:i49-i55. Boelens HF, Dijkstra RJ, Eilers PH, Fitzpatrick F, Westerhuis JA. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and Raman spectroscopic detection. Journal of Chromatography A. 2004;1057:21-30. Bonn B, Leandersson C, Fontaine F, Zamora I. Enhanced metabolite identification with MS(E) and a semi-automated
software
for
structural
elucidation.
2010;24:3127-38. Breiman L. Random forests. Mach Learn. 2001;45:5-32. 54
Rapid
Commun
Mass
Spectrom.
ACCEPTED MANUSCRIPT Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees: CRC press; 1984. Breitling R, Ritchie S, Goodenowe D, Stewart ML, Barrett MP. Ab initio prediction of metabolic networks using Fourier transform mass spectrometry data. Metabolomics. 2006;2:155-64. Brereton RG. Consequences of sample size, variable selection, and model validation and optimisation,
T
for predicting classification ability from analytical data. Trac-Trend Anal Chem. 2006;25:1103-11. Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away. J Chemometr.
RI P
2014;28:213-25.
Brown M, Dunn WB, Dobson P, Patel Y, Winder CL, Francis-McIntyre S, et al. Mass spectrometry tools and metabolite-specific databases for molecular identification in metabolomics. Analyst.
SC
2009;134:1322-32.
Brown M, Wedge DC, Goodacre R, Kell DB, Baker PN, Kenny LC, et al. Automated workflows for accurate mass-based putative metabolite identification in LC/MS-derived metabolomic datasets.
MA NU
Bioinformatics. 2011;27:1108-12.
Burges CJ. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery. 1998;2:121-67.
Bylesjo M, Rantalainen M, Nicholson J, Holmes E, Trygg J. K-OPLS package: Kernel-based orthogonal projections to latent structures for prediction and interpretation in feature space. Bmc Bioinformatics. 2008;9:106.
Bylund D, Danielsson R, Malmquist G, Markides KE. Chromatographic alignment by warping and programming
as
a
pre-processing
ED
dynamic
tool
for
PARAFAC
modelling
of
liquid
chromatography–mass spectrometry data. Journal of Chromatography A. 2002;961:237-44. Cai W, Li Y, Shao X. A variable selection method based on uninformative variable elimination for
PT
multivariate calibration of near-infrared spectra. Chemometr Intell Lab. 2008;90:188-94. Cao DS, Zeng MM, Yi LZ, Wang B, Xu QS, Hu QN, et al. A novel kernel Fisher discriminant analysis:
CE
constructing informative kernel by decision tree ensemble for metabolomics data analysis. Anal Chim Acta. 2011;706:97-104.
Cao MD, Sitter B, Bathen TF, Bofin A, Lønning PE, Lundgren S, et al. Predicting long-term survival and
AC
treatment response in breast cancer patients receiving neoadjuvant chemotherapy by MR metabolic profiling. NMR in Biomedicine. 2012;25:369-78. Castillo S, Gopalacharyulu P, Yetukuri L, Orešič M. Algorithms and tools for the preprocessing of LC–MS metabolomics data. Chemometrics and Intelligent Laboratory Systems. 2011;108:23-32. Centner V, Massart D-L, de Noord OE, de Jong S, Vandeginste BM, Sterna C. Elimination of Uninformative Variables for Multivariate Calibration. Anal Chem. 1996a;68:3851-8. Centner V, Massart DL, de Noord OE, de Jong S, Vandeginste BM, Sterna C. Elimination of uninformative variables for multivariate calibration. Anal Chem. 1996b;68:3851-8. Chan ECY, Koh PK, Mal M, Cheah PY, Eu KW, Backshall A, et al. Metabolic Profiling of Human Colorectal Cancer Using High-Resolution Magic Angle Spinning Nuclear Magnetic Resonance (HR-MAS NMR) Spectroscopy and Gas Chromatography Mass Spectrometry (GC/MS). J Proteome Res. 2009;8:352-61. Chen T, Martin E. Bayesian linear regression and variable selection for spectroscopic calibration. Ana Chim Acta. 2009;631:13-21. Chong IG, Jun CH. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems. 2005;78:103-12. Creek D, Dunn W, Fiehn O, Griffin J, Hall R, Lei Z, et al. Metabolite identification: are you sure? And how do your peers gauge your confidence? Metabolomics. 2014;10:350-3. 55
ACCEPTED MANUSCRIPT Creek DJ, Jankevics A, Burgess KE, Breitling R, Barrett MP. IDEOM: an Excel interface for analysis of LC-MS-based metabolomics data. Bioinformatics. 2012;28:1048-9. Cusido RM, Onrubia M, Sabater-Jara AB, Moyano E, Bonfill M, Goossens A, et al. A rational approach to improving the biotechnological production of taxanes in plant cell cultures of Taxus spp.
T
Biotechnology Advances. 2014;32:1157-67. Damen H, Henneberg D, Weimann B. Siscom — a new library search system for mass spectra.
RI P
Analytica Chimica Acta. 1978;103:289-302.
Danielsson R, Bylund D, Markides KE. Matched filtering with background suppression for improved quality of base peak chromatograms and mass spectra in liquid chromatography–mass spectrometry.
SC
Analytica Chimica Acta. 2002;454:167-84.
Davey MR, Anthony P, Power JB, Lowe KC. Plant protoplasts: status and biotechnological perspectives. Biotechnology Advances. 2005;23:131-71.
MA NU
De Souza DP, Saunders EC, McConville MJ, Likić VA. Progressive peak clustering in GC-MS Metabolomic experiments applied to Leishmania parasites. Bioinformatics. 2006;22:1391-6. De Vos RC, Moco S, Lommen A, Keurentjes JJ, Bino RJ, Hall RD. Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nature protocols. 2007;2:778-91.
Deborde C, Erban A, Kopka J, Goodacre R, Hall RD. Plant metabolomics and its potential for systems biology research: Background concepts, technology, and methodology. Methods in Systems Biology.
ED
2011;500:299.
Deng B-c, Yun Y-h, Liang Y-z, Yi L-z. A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. Analyst. 2014;139:4836-45.
PT
Doerfler H, Sun X, Wang L, Engelmeier D, Lyon D, Weckwerth W. mzGroupAnalyzer--predicting pathways and novel chemical structures from untargeted high-throughput metabolomics data. PLoS
CE
One. 2014;9:e96188.
Dong H, Zhang AH, Sun H, Wang HY, Lu X, Wang M, et al. Ingenuity pathways analysis of urine metabolomics phenotypes toxicity of Chuanwu in Wistar rats by UPLC-Q-TOF-HDMS coupled with
AC
pattern recognition methods. Mol Biosyst. 2012;8:1206-21. Draisma HHM, Reijmers TH, Meulman JJ, van der Greef J, Hankemeier T, Boomsma DI. Hierarchical clustering analysis of blood plasma lipidomics profiles from mono- and dizygotic twin families. Eur J Hum Genet. 2013;21:95-101. Draper J, Enot DP, Parker D, Beckmann M, Snowdon S, Lin W, et al. Metabolite signal identification in accurate mass metabolomics data with MZedDB, an interactive m/z annotation tool utilising predicted ionisation behaviour 'rules'. BMC Bioinformatics. 2009;10:227. Du P, Kibbe WA, Lin SM. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics. 2006;22:2059-65. Dunn WB, Broadhurst D, Begley P, Zelena E, Francis-McIntyre S, Anderson N, et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat Protoc. 2011;6:1060-83. Duran AL, Yang J, Wang L, Sumner LW. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics. 2003;19:2283-93. Egertson JD, Eng JK, Bereman MS, Hsieh EJ, Merrihew GE, MacCoss MJ. De novo correction of mass measurement error in low resolution tandem MS spectra for shotgun proteomics. J Am Soc Mass Spectrom. 2012;23:2075-82. 56
ACCEPTED MANUSCRIPT Eilers PH. Parametric time warping. Analytical chemistry. 2004;76:404-11. Eng JK, Fischer B, Grossmann J, MacCoss MJ. A Fast SEQUEST Cross Correlation Algorithm. J Proteome Res. 2008;7:4598-602. Ernst M, Silva DB, Silva RR, Vêncio RZ, Lopes NP. Mass spectrometry in plant metabolomics strategies:
T
from analytical platforms to data acquisition and processing. Natural product reports. 2014. Erve JCL, Gu M, Wang YD, DeMaio W, Talaat RE. Spectral Accuracy of Molecular Ions in an
RI P
LTQ/Orbitrap Mass Spectrometer and Implications for Elemental Composition Determination. J Am Soc Mass Spectr. 2009;20:2058-69.
Fan Y, Murphy TB, Byrne JC, Brennan L, Fitzpatrick JM, Watson RWG. Applying Random Forests To
SC
Identify Biomarker Panels in Serum 2D-DIGE Data for the Detection and Staging of Prostate Cancer. J Proteome Res. 2010;10:1361-73.
Favilla S, Durante C, Vigni ML, Cocchi M. Assessing feature relevance in NPLS models by VIP. Chemom
MA NU
Intell Lab Syst. 2013;129:76-86.
Felinger A. Data analysis and signal processing in chromatography: Elsevier; 1998. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM. Electrospray Ionization for Mass Spectrometry of Large Biomolecules. Science. 1989;246:64-71.
Fernandez-Albert F, Llorach R, Andres-Lacueva C, Perera A. An R package to analyse LC/MS metabolomic
data:
MAIT
(Metabolite
2014;30:1937-9.
Automatic
Identification
Toolkit).
Bioinformatics.
ED
Fernie AR, Stitt M. On the discordance of metabolomics with proteomics and transcriptomics: coping with increasing complexity in logic, chemistry, and network interactions scientific correspondence. Plant Physiology. 2012;158:1139-45.
PT
Fiehn O, Kopka J, Trethewey RN, Willmitzer L. Identification of uncommon plant metabolites based on calculation of elemental compositions using gas chromatography and quadrupole mass spectrometry.
CE
Anal Chem. 2000;72:3573-80.
Fiehn O, Spranger J. Use of Metabolomics to Discover Metabolic Patterns Associated with Human Diseases. In: Harrigan G, Goodacre R, editors. Metabolic Profiling: Its Role in Biomarker Discovery and
AC
Gene Function Analysis: Springer US; 2003. p. 199-215. Field D, Sansone SA. A special issue on data standards. Omics-a Journal of Integrative Biology. 2006;10:84-93.
Filzmoser P, Liebmann B, Varmuza K. Repeated double cross validation. J Chemometr. 2009;23:160-71. Forina M, Casolino C, Pizarro Millan C. Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems. J Chemometr. 1999;13:165-84. Galvao RKH, Araujo MCU, Jose GE, Pontes MJC, Silva EC, Saldanha TCB. A method for calibration and validation subset partitioning. Talanta. 2005;67:736-40. Gan F, Ruan G, Mo J. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems. 2006;82:59-65. Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc. 1975;70:320-8. Genga A, Mattana M, Coraggio I, Locatelli F, Piffanelli P, Consonni R. Plant Metabolomics: A characterisation of plant responses to abiotic stresses. 2011. Gerlich M, Neumann S. MetFusion: integration of compound identification strategies. Journal of Mass Spectrometry. 2013;48:291-8. Gika HG, Macpherson E, Theodoridis GA, Wilson ID. Evaluation of the repeatability of ultra-performance liquid chromatography–TOF-MS for global metabolic profiling of human urine 57
ACCEPTED MANUSCRIPT samples. Journal of Chromatography B. 2008a;871:299-305. Gika HG, Theodoridis G, Extance J, Edge AM, Wilson ID. High temperature-ultra performance liquid chromatography–mass spectrometry for the metabonomic analysis of Zucker rat urine. Journal of Chromatography B. 2008b;871:279-87.
T
Gipson GT, Tatsuoka KS, Sokhansanj BA, Ball RJ, Connor SC. Assignment of MS-based metabolomic datasets via compound interaction pair mapping. Metabolomics. 2008;4:94-103. Springer; 2005. p. 501-15.
RI P
Golland P, Liang F, Mukherjee S, Panchenko D. Permutation tests for classification.
Learning Theory:
Goodacre R. Making sense of the metabolome using evolutionary computation: seeing the wood with
SC
the trees. Journal of experimental botany. 2005;56:245-54.
Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in biotechnology. 2004;22:245-52.
MA NU
Gosselin R, Rodrigue D, Duchesne C. A Bootstrap-VIP approach for selecting wavelength intervals in spectral imaging applications. Chemometrics and Intelligent Laboratory Systems. 2010;100:12-21. H Martens TN. Multivariate Calibration. New York: Wiley; 1989. Haas W, Faherty BK, Gerber SA, Elias JE, Beausoleil SA, Bakalarski CE, et al. Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Molecular & Cellular Proteomics. 2006;5:1326-37.
Haimi P, Uphoff A, Hermansson M, Somerharju P. Software tools for analysis of mass spectrometric
ED
lipidome data. Analytical chemistry. 2006;78:8324-31. Halket JM, Waterman D, Przyborowska AM, Patel RKP, Fraser PD, Bramley PM. Chemical derivatization and mass spectral libraries in metabolic profiling by GC/MS and LC/MS/MS. Journal of Experimental
PT
Botany. 2005;56:219-43.
Hall MA. Correlation-based feature selection for machine learning: The University of Waikato; 1999.
CE
Hall RD. Plant metabolomics: from holistic hope, to hype, to hot topic. New Phytologist. 2006;169:453-68.
Hall RD. Annual Plant Reviews, Biology of Plant Metabolomics: John Wiley & Sons; 2011.
AC
Hantao LW, Aleme HG, Pedroso MP, Sabin GP, Poppi RJ, Augusto F. Multivariate curve resolution combined with gas chromatography to enhance analytical separation in complex samples: A review. Analytica Chimica Acta. 2012;731:11-23. Hastings CA, Norton SM, Roy S. New algorithms for processing and peak detection in liquid chromatography/mass
spectrometry
data.
Rapid
communications
in
mass
spectrometry.
2002;16:462-7. Heinonen M, Rantanen A, Mielikainen T, Kokkonen J, Kiuru J, Ketola RA, et al. FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. Rapid Commun Mass Spectrom. 2008;22:3043-52. Heinonen M, Shen H, Zamboni N, Rousu J. Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics. 2012;28:2333-41. Hilario M, Kalousis A, Pellegrini C, Mueller M. Processing and classification of protein mass spectra. Mass spectrometry reviews. 2006;25:409-49. Hill AW, Mortishire-Smith RJ. Automated assignment of high-resolution collisionally activated dissociation mass spectra using a systematic bond disconnection approach. Rapid Commun Mass Sp. 2005;19:3111-8. iller 58
,
ange rauk J, J ger C, Spura J, Schreiber K, Schomburg D. MetaboliteDetector:
ACCEPTED MANUSCRIPT comprehensive analysis tool for targeted and nontargeted GC/MS based metabolome analysis. Analytical chemistry. 2009;81:3429-39. Holcapek M, Jirasko R, Lisa M. Basic rules for the interpretation of atmospheric pressure ionization mass spectra of small molecules. J Chromatogr A. 2010;1217:3908-21.
T
Holmes E, Loo RL, Stamler J, Bictash M, Yap IKS, Chan Q, et al. Human metabolic phenotype diversity and its association with diet and blood pressure. Nature. 2008;453:396-U50.
RI P
Hoskuldsson A. Variable and subset selection in PLS regression. Chemometr Intell Lab. 2001;55:23-38. Huang N, Siegel MM, Kruppa GH, Laukien FH. Automation of a Fourier transform ion cyclotron resonance mass spectrometer for acquisition, analysis, and E-mailing of high-resolution exact-mass
SC
electrospray ionization mass spectral data. J Am Soc Mass Spectr. 1999;10:1166-73. Huang ZZ, Chen YJ, Hang W, Gao Y, Lin L, Li DY, et al. Holistic metabonomic profiling of urine affords potential early diagnosis for bladder and kidney cancers. Metabolomics. 2013;9:119-29.
MA NU
Hubert J, Nuzillard JM, Purson S, Hamzaoui M, Borie N, Reynaud R, et al. Identification of Natural Metabolites in Mixture: A Pattern Recognition Strategy Based on C-13 NMR. Analytical Chemistry. 2014;86:2955-62.
Hufsky F, Rempt M, Rasche F, Pohnert G, Böcker S. De novo analysis of electron impact mass spectra using fragmentation trees. Analytica Chimica Acta. 2012;739:67-76. Hufsky F, Scheubert K, Bocker S. Computational mass spectrometry for small-molecule fragmentation. Trac-Trend Anal Chem. 2014;53:41-8.
ED
Hummel J, Strehmel N, Selbig J, Walther D, Kopka J. Decision tree supported substructure prediction of metabolites from GC-MS profiles. Metabolomics. 2010;6:322-33. Jirasek A, Schulze G, Yu M, Blades M, Turner R. Accuracy and precision of manual baseline
PT
determination. Applied spectroscopy. 2004;58:1488-99. Johnson KJ, Wright BW, Jarman KH, Synovec RE. High-speed peak matching algorithm for retention
CE
time alignment of gas chromatographic data for chemometric analysis. Journal of Chromatography A. 2003;996:141-55.
Kalivas JH, Roberts N, Sutter JM. Global optimization by simulated annealing with wavelength
AC
selection for ultraviolet-visible spectrophotometry. Anal Chem. 1989;61:2024-30. Kangas LJ, Metz TO, Isaac G, Schrom BT, Ginovska-Pangovska B, Wang L, et al. In silico identification software (ISIS): a machine learning approach to tandem mass spectral identification of lipids. Bioinformatics. 2012;28:1705-13. atajamaa M, Miettinen J, Orešič M. MZmine: tool ox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics. 2006;22:634-6. atajamaa M, Orešič M. Data processing for mass spectrometry-based metabolomics. Journal of Chromatography A. 2007;1158:318-28. Kaufmann A. Strategy for the elucidation of elemental compositions of trace analytes based on a mass resolution of 100 000 full width at half maximum. Rapid Commun Mass Sp. 2010;24:2035-45. Kell DB. Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug discovery today. 2006;11:1085-92. Keller BO, Suj J, Young AB, Whittal RM. Interferences and contaminants encountered in modern mass spectrometry. Analytica Chimica Acta. 2008;627:71-81. Kennard RW, Stone LA. Computer Aided Design of Experiments. Technometrics. 1969;11:137-48. Kerber A, Laue R, Meringer M, Varmuza K. MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation. In: Gelpi E, editor. Advances 59
ACCEPTED MANUSCRIPT in Mass Spectrometry 15: Wiley; 2001. p. 939-40. Keurentjes JJ, Fu J, De Vos CR, Lommen A, Hall RD, Bino RJ, et al. The genetics of plant metabolism. Nature genetics. 2006;38:842-9. Kim HK, Choi YH, Verpoorte R. NMR-based plant metabolomics: where do we stand, where do we go?
T
Trends in biotechnology. 2011;29:267-75. 2010;21:4-13.
RI P
Kim HK, Verpoorte R. Sample preparation for plant metabolomics. Phytochemical analysis. Kim S, Zhang X. Discovery of false identification using similarity difference in GC–MS-based metabolomics. J Chemometr. 2014:doi: 10.1002/cem.2665.
SC
Kind T, Fiehn O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics. 2007;8:105.
Kind T, Fiehn O. Advances in structure elucidation of small molecules using mass spectrometry. Bioanal
MA NU
Rev. 2010;2:23-60.
Kind T, Wohlgemuth G, Lee DY, Lu Y, Palazoglu M, Shahbaz S, et al. FiehnLib: Mass Spectral and Retention Index Libraries for Metabolomics Based on Quadrupole and Time-of-Flight Gas Chromatography/Mass Spectrometry. Anal Chem. 2009;81:10038-48. Knolhoff A, Callahan J, Croley T. Mass Accuracy and Isotopic Abundance Measurements for HR-MS Instrumentation: Capabilities for Non-Targeted Analyses. J Am Soc Mass Spectr. 2014;25:1285-94. Koch BP, Dittmar T, Witt M, Kattner G. Fundamentals of molecular formula assignment to ultrahigh
ED
resolution mass data of natural organic matter. Anal Chem. 2007;79:1758-63. Koekemoer G, Dercksen M, Allison J, Santana L, Reinecke CJ. Concurrent class analysis identifies discriminatory variables from metabolomics data on isovaleric acidemia. Metabolomics.
PT
2012;8:S17-S28.
Kohonen T, Maps S-O. Springer series in information sciences. Self-organizing maps. 1995;30.
CE
Koo I, Kim S, Zhang X. Comparative analysis of mass spectral matching-based compound identification in gas chromatography-mass spectrometry. J Chromatogr A. 2013;1298:132-8. Kopka J. Current challenges and developments in GC-MS based metabolite profiling technology.
AC
Journal of Biotechnology. 2006;124:312-22. Kopka J, Schauer N, Krueger S, Birkemeyer C, Usadel B, Bergmuller E, et al.
[email protected]: the Golm Metabolome Database. Bioinformatics. 2005;21:1635-8. Kriegel H-P, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD). 2009;3:1. Krishnan S, Vogels JT, Coulier L, Bas RC, Hendriks MW, Hankemeier T, et al. Instrument and process independent
binning
and
baseline
correction
methods
for
liquid
chromatography–high
resolution-mass spectrometry deconvolution. Analytica Chimica Acta. 2012;740:12-9. Krooshof PWT, Ustun B, Postma GJ, Buydens LMC. Visualization and Recovery of the (Bio)chemical Interesting Variables in Data Analysis with Support Vector Machine Classification. Analytical Chemistry. 2010;82:7000-7. Krstajic D, Buturovic L, Leahy D, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminformatics. 2014;6:10. Kueger S, Steinhauser D, Willmitzer L, Giavalisco P. High‐resolution plant metabolomics: from mass spectral features to metabolites and from whole‐cell analysis to subcellular metabolite distributions. The Plant Journal. 2012;70:39-50. 60
ACCEPTED MANUSCRIPT Kuehl D, Wang YD. Peak shape calibration method improves the mass accuracy of mass spectrometers. Biopharm Int. 2006;19:32-+. Kuhl C, Tautenhahn R, Bottcher C, Larson TR, Neumann S. CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets.
T
Anal Chem. 2012;84:283-9. Kumari S, Stevens D, Kind T, Denkert C, Fiehn O. Applying In-Silico Retention Index and Mass Spectra
RI P
Matching for Identification of Unknown Metabolites in Accurate Mass GC-TOF Mass Spectrometry. Anal Chem. 2011;83:5895-902.
Kvalheim OM. Interpretation of partial least squares regression models by means of target projection
SC
and selectivity ratio plots. Journal of Chemometrics. 2010;24:496-504.
Kvalheim OM, Brakstad F, Liang Y. Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Analytical chemistry. 1994;66:43-51.
MA NU
Kvalheim OM, Karstang TV. Interpretation of latent-variable regression models. Chemometr Intell Lab. 1989;7:39-51.
Kvalheim OM, Liang YZ. Heuristic evolving latent projections: resolving two-way multicomponent data. 1. Selectivity, latent-projective graph, datascope, local rank, and unique resolution. Anal Chem. 1992;64:936-46.
Kvalheim OM, Rajalahti T, Arneberg R. X-tended target projection (XTP)—comparison with orthogonal Chemometr. 2009;23:49-55.
ED
partial least squares (OPLS) and PLS post-processing by similarity transformation (PLS + ST). J Leardi R. Application of genetic algorithm–PLS for feature selection in spectral data sets. J Chemometrics. 2000;14:643-55.
PT
Leardi R. Genetic algorithms in chemometrics and chemistry: a review. J Chemometrics. 2001;15:559-69.
CE
Lei Z, Li H, Chang J, Zhao PX, Sumner LW. MET-IDEA version 2.06; improved efficiency and additional functions for mass spectrometry-based metabolomics data processing. Metabolomics. 2012;8:105-10. Leptos KC, Sarracino DA, Jaffe JD, Krastins B, Church GM. MapQuant: Open‐source software for large
AC
‐scale protein quantification. Proteomics. 2006;6:1770-82. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Ana Chim Acta. 2009a;648:77-84. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Model population analysis for variable selection. J Chemometr. 2010a;24:418-23. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Recipe for Uncovering Predictive Genes Using Support Vector Machines Based on Model Population Analysis. IEEE ACM T Comput Bi. 2011;8:1633-41. Li H-D, Xu Q-S, Zhang W, Liang Y-Z. Variable complementary network: a novel approach for identifying biomarkers and their mutual associations. Metabolomics. 2012;8:1218-26. Li H-D, Zeng M-M, Tan B-B, Liang Y-Z, Xu Q-S, Cao D-S. Recipe for revealing informative metabolites based on model population analysis. Metabolomics. 2010b;6:353-61. Li HD, Liang YZ, Xu QS. Support vector machines and its applications in chemistry. Chemometr Intell Lab. 2009b;95:188-98. Li HD, Liang YZ, Xu QS, Cao DS. Model population analysis for variable selection. J Chemometr. 2010c;24:418-23. Li S, Park Y, Duraisingham S, Strobel FH, Khan N, Soltow QA, et al. Predicting Network Activity from High Throughput Metabolomics. PLoS Comput Biol. 2013a;9:e1003123. 61
ACCEPTED MANUSCRIPT Li X-j, Eugene CY, Kemp CJ, Zhang H, Aebersold R. A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Molecular & cellular proteomics. 2005;4:1328-40. Li Z, Wang JJ, Huang J, Zhang ZM, Lu HM, Zheng YB, et al. Nonlinear alignment of chromatograms by
T
means of moving window fast Fourier transfrom cross‐correlation. Journal of separation science. 2013b;36:1677-84.
RI P
Liang J, Yang S, Winstanley A. Invariant optimal feature selection: A distance discriminant and feature ranking based solution. Pattern Recognition. 2008;41:1429-39.
Liang YZ, Kvalheim OM. Resolution of two-way data: theoretical background and practical
SC
problem-solving - Part 1: Theoretical background and methodology. Fresen J Anal Chem. 2001;370:694-704.
Liang YZ, Kvalheim OM, Keller HR, Massart DL, Kiechle P, Erni F. Heuristic evolving latent projections:
MA NU
resolving two-way multicomponent data. 2. Detection and resolution of minor constituents. Anal Chem. 1992;64:946-53.
Lin X, Wang Q, Yin P, Tang L, Tan Y, Li H, et al. A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics. 2011;7:549-58. Lindsay RK, Buchanan BG, Feigenbaum EA, Lederberg J. Dendral - a Case-Study of the 1st Expert-System for Scientific Hypothesis Formation. Artif Intell. 1993;61:209-61.
ED
Lisec J, Schauer N, Kopka J, Willmitzer L, Fernie AR. Gas chromatography mass spectrometry-based metabolite profiling in plants. Nature Protocols. 2006;1:387-96. Listgarten J, Neal RM, Roweis ST, Wong P, Emili A. Difference detection in LC-MS data for protein
PT
biomarker discovery. Bioinformatics. 2007;23:e198-e204. Little J, Williams A, Pshenichnov A, Tkachenko V. Identification of “ nown Unknowns” Utilizing
CE
Accurate Mass Data and ChemSpider. J Am Soc Mass Spectr. 2012;23:179-85. Liu H, Motoda H. Feature selection for knowledge discovery and data mining: Springer; 1998. Liu R, Lin D, Chang W, Liu C, Tsay W, Li J, et al. Issues to address when isotopically labeled analogues of
AC
analytes are used as internal standards. Anal Chem. 2002;74:618AJ26A. Liu X, Zhang Z, Sousa PF, Chen C, Ouyang M, Wei Y, et al. Selective iteratively reweighted quantile regression for baseline correction. Analytical and bioanalytical chemistry. 2014a:1-14. Liu Y, Hong Z, Tan G, Dong X, Yang G, Zhao L, et al. NMR and LC/MS-based global metabolomics to identify serum biomarkers differentiating hepatocellular carcinoma from liver cirrhosis. International Journal of Cancer. 2014b;135:658-68. Lommen A. Ultrafast PubChem Searching Combined with Improved Filtering Rules for Elemental Composition Analysis. Anal Chem. 2014;86:5463-9. Lopatka M, Vivó-Truyols G, Sjerps M. Probabilistic peak detection for first-order chromatographic data. Analytica Chimica Acta. 2014;817:9-16. Luedemann A, Strassburg K, Erban A, Kopka J. TagFinder for the quantitative analysis of gas chromatography—mass
spectrometry
(GC-MS)-based
metabolite
profiling
experiments.
Bioinformatics. 2008;24:732-7. Luedemann A, von Malotky L, Erban A, Kopka J. TagFinder: Preprocessing software for the fingerprinting and the profiling of gas chromatography–mass spectrometry based metabolome analyses.
Plant Metabolomics: Springer; 2012. p. 255-86.
Luts J, Ojeda F, Van de Plas R, De Moor B, Van Huffel S, Suykens JAK. A tutorial on support vector 62
ACCEPTED MANUSCRIPT machine-based methods for classification problems in chemometrics. Anal Chim Acta. 2010;665:129-45. Maeder M. Evolving factor analysis for the resolution of overlapping chromatographic peaks. Analytical chemistry. 1987;59:527-30.
T
Mahadevan S, Shah SL, Marrie TJ, Slupsky CM. Analysis of metabolomic data using support vector machines. Analytical Chemistry. 2008;80:7562-70.
RI P
Makinen VP, Soininen P, Forsblom C, Parkkonen M, Ingman P, Kaski K, et al. (1)H NMR metabonomics approach to the disease continuum of diabetic complications and premature death. Mol Syst Biol. 2008;4.
SC
Mallows CL. Some comments on C p. Technometrics. 1973;15:661-75.
Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other Stochastically Larger than the Other. Ann Math Statist. 1947;18:50-60.
MA NU
Manne R, Shen HL, Liang YZ. Subwindow factor analysis. Chemometrics and Intelligent Laboratory Systems. 1999;45:171-6.
Mao Q, Bai M, Xu JD, Kong M, Zhu LY, Zhu H, et al. Discrimination of leaves of Panax ginseng and P. quinquefolius by ultra high performance liquid chromatography quadrupole/time-of-flight mass spectrometry based metabolomics approach. Journal of Pharmaceutical and Biomedical Analysis. 2014;97:129-40.
McLafferty FW, Hertel RH, Villwock RD. Computer Identification of Mass-Spectra .6. Probability Based Spectrom. 1974;9:690-702.
ED
Matching of Mass-Spectra - Rapid Identification of Specific Compounds in Mixtures. Org Mass Miller A. Subset selection in regression: CRC Press; 2002.
PT
Mitra P, Murthy C, Pal SK. Unsupervised feature selection using feature similarity. IEEE transactions on pattern analysis and machine intelligence. 2002;24:301-12.
CE
Miura D, Tsuji Y, Takahashi K, Wariishi H, Saito K. A Strategy for the Determination of the Elemental Composition by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry Based on Isotopic Peak Ratios. Anal Chem. 2010;82:5887-91.
AC
Moco S, Bino RJ, Vorst O, Verhoeven HA, de Groot J, van Beek TA, et al. A liquid chromatography-mass spectrometry-based metabolome database for tomato. Plant Physiology. 2006;141:1205-18. Mylonas R, Mauron Y, Masselot A, Binz PA, Budin N, Fathi M, et al. X-Rank: A Robust Algorithm for Small Molecule Identification Using Tandem Mass Spectrometry. Anal Chem. 2009;81:7604-10. Nagao T, Yukihira D, Fujimura Y, Saito K, Takahashi K, Miura D, et al. Power of isotopic fine structure for unambiguous determination of metabolite elemental compositions: in silico evaluation and metabolomic application. Anal Chim Acta. 2014;813:70-6. Narsky I, Porter FC. Methods for Variable Ranking and Selection.
Statistical Analysis Techniques in
Particle Physics: Wiley-VCH Verlag GmbH & Co. KGaA; 2013. p. 385-415. Neumann S, Rasche F, Wolf S, Böcker S. Metabolite Identification and Computational Mass Spectrometry.
The Handbook of Plant Metabolomics: Wiley-VCH Verlag GmbH & Co. KGaA; 2013. p.
289-303. Nielsen N-PV, Carstensen JM, Smedsgaard J. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. Journal of Chromatography A. 1998;805:17-35. North DO. An analysis of the factors which determine signal/noise discrimination in pulsed-carrier systems. Proceedings of the IEEE. 1963;51:1016-27. 63
ACCEPTED MANUSCRIPT Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27:29-34. Oksman-Caldentey K-M, Saito K. Integrating genomics and metabolomics for engineering plant metabolic pathways. Current opinion in biotechnology. 2005;16:174-9. chromatography/mass spectrometry.
T
Osorio S, Do PT, Fernie AR. Profiling primary metabolites of tomato fruit with gas Plant Metabolomics: Springer; 2012. p. 101-9.
RI P
Patterson AD, Li H, Eichler GS, Krausz KW, Weinstein JN, Fornace AJ, et al. UPLC-ESI-TOFMS-Based Metabolomics and Gene Expression Dynamics Inspector Self-Organizing Metabolomic Maps as Tools for Understanding the Cellular Response to Ionizing Radiation. Analytical Chemistry. 2008;80:665-74.
SC
Pearson GA. A general baseline-recognition and baseline-flattening algorithm. Journal of Magnetic Resonance (1969). 1977;27:265-72. 1901;2:559-72.
MA NU
Pearson K. On lines and planes of closest fit to systems of points in space Philosophical Magazine. Peironcely JE, Rojas-Cherto M, Fichera D, Reijmers T, Coulier L, Faulon JL, et al. OMG: Open Molecule Generator. J Cheminform. 2012;4:21.
Petyuk VA, Jaitly N, Moore RJ, Ding J, Metz TO, Tang K, et al. Elimination of systematic mass measurement errors in liquid chromatography-mass spectrometry based proteomics using regression models and a Priori partial knowledge of the sample content. Anal Chem. 2008;80:693-706. Petyuk VA, Mayampurath AM, Monroe ME, Polpitiya AD, Purvine SO, Anderson GA, et al. DtaRefinery,
ED
a Software Tool for Elimination of Systematic Errors from Parent Ion Mass Measurements in Tandem Mass Spectra Data Sets. Molecular & Cellular Proteomics. 2010;9:486-96. Pierce KM, Mohler RE. A Review of chemometrics applied to comprehensive two-dimensional
PT
separations from 2008–2010. Separation & Purification Reviews. 2012;41:143-68. Pierce KM, Wood LF, Wright BW, Synovec RE. A comprehensive two-dimensional retention time
CE
alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Analytical chemistry. 2005;77:7735-43. Pluskal T, Castillo S, Villar-Briones A, Orešič M. MZmine 2: modular framework for processing,
AC
visualizing, and analyzing mass spectrometry-based molecular profile data. BMC bioinformatics. 2010;11:395.
Powell LA, Hieftje GM. Computer identification of infrared spectra by correlation-based file searching. Anal Chim Acta. 1978;100:313-27. Prakash A, Mallick P, Whiteaker J, Zhang H, Paulovich A, Flory M, et al. Signal maps for mass spectrometry-based comparative proteomics. Molecular & cellular proteomics. 2006;5:423-32. Pravdova V, Walczak B, Massart D. A comparison of two algorithms for warping of analytical signals. Analytica Chimica Acta. 2002;456:77-92. Prince JT, Marcotte EM. Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry. 2006;78:6140-52. Radulovic D, Jelveh S, Ryu S, Hamilton TG, Foss E, Mao Y, et al. Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry. Molecular & cellular proteomics. 2004;3:984-97. Rago D, Mette K, Gurdeniz G, Marini F, Poulsen M, Dragsted LO. A LC-MS metabolomics approach to investigate the effect of raw apple intake in the rat plasma metabolome. Metabolomics. 2013;9:1202-15. Rajalahti T, Arneberg R, Berven FS, Myhr K-M, Ulvik RJ, Kvalheim OM. Biomarker discovery in mass 64
ACCEPTED MANUSCRIPT spectral profiles by means of selectivity ratio plot. Chemometr Intell Lab. 2009a;95:35-48. Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr K-M, Kvalheim OM. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in Complex Spectral or Chromatographic Profiles. Analytical Chemistry. 2009b;81:2581-90.
T
Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr KM, Kvalheim OM. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in
RI P
Complex Spectral or Chromatographic Profiles. Analytical Chemistry. 2009c;81:2581-90. Rasche F, Svatos A, Maddula RK, Bottcher C, Bocker S. Computing fragmentation trees from tandem mass spectrometry data. Anal Chem. 2011;83:1243-51.
SC
Rasmussen S, Lane GA, Mace W, Parsons AJ, Fraser K, Xue H. The use of genomics and metabolomics methods to quantify fungal endosymbionts and alkaloids in grasses. 2012. p. 213-26.
Plant Metabolomics: Springer;
MA NU
Rauf I, Rasche F, Nicolas F, Böcker S. Finding Maximum Colorful Subtrees in Practice. In: Chor B, editor. Lect N Bioinformat: Springer Berlin Heidelberg; 2012. p. 213-23. Redestig H, Fukushima A, Stenlund H, Moritz T, Arita M, Saito K, et al. Compensation for Systematic Cross-Contribution Improves Normalization of Mass Spectrometry Based Metabolomics Data. Analytical chemistry. 2009;81:7974-80.
Robnik-Šikonja M, ononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning. 2003;53:23-69.
ED
Rogers S, Scheltema RA, Girolami M, Breitling R. Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics. 2009;25:512-8. Ruckebusch C, Blanchet L. Multivariate curve resolution: A review of advanced and tailored
PT
applications and challenges. Analytica Chimica Acta. 2013;765:28-36. Sadygov RG, Martin Maroto F, Hühmer AF. ChromAlign: a two-step algorithmic procedure for time of
three-dimensional
LC-MS
chromatographic
surfaces.
Analytical
chemistry.
CE
alignment
2006;78:8207-17.
Savitski MM, Ivonin IA, Nielsen ML, Zubarev RA, Tsybin YO, Hakansson P. Shifted-basis technique
AC
improves accuracy of peak position determination in Fourier transform mass spectrometry. J Am Soc Mass Spectr. 2004;15:457-61. Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, Moritz T, et al. GC-MS libraries for the rapid identification of metabolites in complex biological samples. Febs Letters. 2005;579:1332-7. Scheltema RA, Kamleh A, Wildridge D, Ebikeme C, Watson DG, Barrett MR, et al. Increasing the mass accuracy of high-resolution LC-MS data using background ions - a case study on the LTQ-Orbitrap. Proteomics. 2008;8:4647-56. Scheubert K, Hufsky F, Bocker S. Computational mass spectrometry for small molecules. J Cheminformatics. 2013;5. Scholkopft B, Mullert K-R. Fisher discriminant analysis with kernels. Neural networks for signal processing IX. 1999. Schwarz G. Estimating the dimension of a model. The annals of statistics. 1978;6:461-4. Schymanski EL, Gallampois CMJ, Krauss M, Meringer M, Neumann S, Schulze T, et al. Consensus Structure Elucidation Combining GC/EI-MS, Structure Generation, and Calculated Properties. Anal Chem. 2012;84:3287-95. Schymanski EL, Jeon J, Gulde R, Fenner K, Ruff M, Singer HP, et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol. 2014;48:2097-8. 65
ACCEPTED MANUSCRIPT Schymanski EL, Meinert C, Meringer M, Brack W. The use of MS classifiers and structure generation to assist in the identification of unknowns in effect-directed analysis. Analytica Chimica Acta. 2008;615:136-47. Schymanski EL, Meringer M, Brack W. Matching Structures to Mass Spectra Using Fragmentation
T
Patterns: Are the Results As Good As They Look? Anal Chem. 2009;81:3608-17. Scott IM, Lin W, Liakata M, Wood JE, Vermeer CP, Allaway D, et al. Merits of random forests emerge in
RI P
evaluation of chemometric classifiers by external validation. Analytica Chimica Acta. 2013;801:22-33. Shao J. Linear Model Selection by Cross-validation. J Am Stat Assoc. 1993;88:486-94. Shao X-G, Leung AK-M, Chau F-T. Wavelet: a new trend in chemistry. Accounts of chemical research.
SC
2003;36:276-83.
Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis: Cambridge university press; 2004. Smilde AK, van der Werf MJ, Bijlsma S, van der Werff-van-der Vat BJC, Jellema RH. Fusion of mass
MA NU
spectrometry-based metabolomics data. Analytical chemistry. 2005;77:6729-36. Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry. 2006;78:779-87.
Snee RD. Validation of regression models: methods and examples. Technometrics. 1977;19:415-28. Sokal R, Rohlf F. Assumptions of analysis of variance. Biometry: The Principles and Practice of Statistics in Biological Research 3rd ed New York: WH Freeman. 1995:396-406.
ED
Solinas A, Chessa M, Culeddu N, Porcu M, Virgilio G, Arcadu F, et al. High resolution-magic angle spinning (HR-MAS) NMR-based metabolomic fingerprinting of early and recurrent hepatocellular carcinoma. Metabolomics. 2014;10:616-26.
PT
Stein S. Mass Spectral Reference Libraries: An Ever-Expanding Resource for Chemical Identification. Anal Chem. 2012;84:7274-82.
CE
Stein SE. Chemical substructure identification by mass spectral library searching. J Am Soc Mass Spectrom. 1995;6:644-55.
Stein SE. An integrated method for spectrum extraction and compound identification from gas
AC
chromatography/mass spectrometry data. J Am Soc Mass Spectr. 1999;10:770-81. Stein SE, Scott DR. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom. 1994;5:859-66. Steinbeck C, Han YQ, Kuhn S, Horlacher O, Luttmann E, Willighagen E. The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics. J Chem Inf Comp Sci. 2003;43:493-500. Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological). 1974:111-47. Sturm M, Bertsch A, Gropl C, Hildebrandt A, Hussong R, Lange E, et al. OpenMS-An open-source software framework for mass spectrometry. BMC Bioinformatics. 2008;9. Sumner LW, Amberg A, Barrett D, Beale MH, Beger R, Daykin CA, et al. Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics. 2007;3:211-21. Sutter JM, Kalivas JH. Comparison of Forward Selection, Backward Elimination, and Generalized Simulated Annealing for Variable Selection. MicrochemJ. 1993;47:60-6. Swiniarski RW, Skowron A. Rough set methods in feature selection and recognition. Pattern Recogn Lett. 2003;24:833-49. 66
ACCEPTED MANUSCRIPT Tapp HS, Kemsley EK. Notes on the practical utility of OPLS. Trac-Trend Anal Chem. 2009;28:1322-7. Tautenhahn R, Böttcher C, Neumann S. Highly sensitive feature detection for high resolution LC/MS. BMC bioinformatics. 2008;9:504. Tikunov Y, Lommen A, de Vos CR, Verhoeven HA, Bino RJ, Hall RD, et al. A novel approach for
T
nontargeted data analysis for metabolomics. Large-scale profiling of tomato fruit volatiles. Plant Physiology. 2005;139:1125-37.
RI P
Tomasi G, van den Berg F, Andersson C. Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data. Journal of Chemometrics. 2004;18:231-41. Toya Y, Shimizu H. Flux analysis and metabolomics for systematic metabolic engineering of
SC
microorganisms. Biotechnology Advances. 2013;31:818-26.
Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS). J Chemometr. 2002;16:119-28. Uarrota VG, Moresco R, Coelho B, Nunes EdC, Peruch LAM, Neubert EdO, et al. Metabolomics
MA NU
combined with chemometric tools (PCA, HCA, PLS-DA and SVM) for screening cassava (Manihot esculenta Crantz) roots during postharvest physiological deterioration. Food Chem. 2014;161:67-78. Valkenborg D, Mertens I, Lemiere F, Witters E, Burzykowski T. The isotopic distribution conundrum. Mass Spectrometry Reviews. 2012;31:96-109.
van Dam NM, Meijden E. A role for metabolomics in plant ecology. Annual Plant Reviews, Biology of Plant Metabolomics. 2011;43:87.
van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and
ED
transformations: improving the biological information content of metabolomics data. BMC genomics. 2006;7:142.
van der Greef J, Smilde AK. Symbiosis of chemometrics and metabolomics: past, present, and future.
PT
Journal of Chemometrics. 2005;19:376-86.
Vapnik V. Statistical Learning Theory. New York: John Willey & Sons; 1998.
CE
Varghese RS, Cheema A, Cheema P, Bourbeau M, Tuli L, Zhou B, et al. Analysis of LC-MS Data for Characterizing the Metabolic Changes in Response to Radiation. J Proteome Res. 2010;9:2786-93. Venable JD, Xu T, Cociorva D, Yates JR, 3rd. Cross-correlation algorithm for calculation of peptide
AC
molecular weight from tandem mass spectra. Anal Chem. 2006;78:1921-9. Verron T, Sabatier R, Joffre R. Some theoretical properties of the O-PLS method. J Chemometr. 2004;18:62-8.
Villas-Boas SG, Mas S, Akesson M, Smedsgaard J, Nielsen J. Mass spectrometry in metabolome analysis. Mass Spectrom Rev. 2005;24:613-46. Villas‐Bôas SG, Mas S, Åkesson M, Smedsgaard J, Nielsen J. Mass spectrometry in metabolome analysis. Mass spectrometry reviews. 2005;24:613-46. Vivó-Truyols G, Torres-Lapasió J, Van Nederkassel A, Vander Heyden Y, Massart D. Automatic program for peak detection and deconvolution of multi-overlapped chromatographic signals: Part I: Peak detection. Journal of Chromatography A. 2005;1096:133-45. Viv -Truyols G. Bayesian approach for peak detection in two-dimensional chromatography. Analytical chemistry. 2012;84:2622-30. Wagner C, Sefkow M, Kopka J. Construction and application of a mass spectral and retention time index database generated from plant GC/EI-TOF-MS metabolite profiles. Phytochemistry. 2003;62:887-900. Walczak B, Massart D. The radial basis functions—partial least squares approach as a flexible non-linear regression technique. Anal Chim Acta. 1996;331:177-85. 67
ACCEPTED MANUSCRIPT Wang JS, Reijmers T, Chen LJ, Van der Heijden R, Wang M, Peng SQ, et al. Systems toxicology study of doxorubicin on rats using ultra performance liquid chromatography coupled with mass spectrometry based metabolomics. Metabolomics. 2009;5:407-18. Wang Q, Li H-D, Xu Q-S, Liang Y-Z. Noise incorporated subwindow permutation analysis for
T
informative gene selection using support vector machines. Analyst. 2011;136:1456-63. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, et al. Quantification of proteins and metabolites by
RI P
mass spectrometry without isotopic labeling or spiked standards. Analytical chemistry. 2003;75:4818-26.
Wang Y, Yi L, Liang Y, Li H, Yuan D, Gao H, et al. Comparative analysis of essential oil components in
SC
Pericarpium Citri Reticulatae Viride and Pericarpium Citri Reticulatae by GC-MS combined with chemometric resolution method. Journal of Pharmaceutical and Biomedical Analysis. 2008;46:66-74. Wang YD, Cu M. The Concept of Spectral Accuracy for MS. Anal Chem. 2010;82:7055-62.
MA NU
Watson DG. A rough guide to metabolite identification using high resolution liquid chromatography mass spectrometry in metabolomic profiling in metazoans. Comput Struct Biotechnol J. 2013;4:e201301005.
Webb AR. Statistical pattern recognition: John Wiley & Sons; 2003. Weber RJM, Southam AD, Sommer U, Viant MR. Characterization of Isotopic Abundance Measurements in High Resolution FT-ICR and Orbitrap Mass Spectra for Improved Confidence of Metabolite Identification. Anal Chem. 2011;83:3737-43.
ED
Weber RJM, Viant MR. MI-Pack: Increased confidence of metabolite identification in mass spectra by integrating accurate masses and metabolic pathways. Chemometr Intell Lab. 2010;104:75-82. Wei X, Sun W, Shi X, Koo I, Wang B, Zhang J, et al. MetSign: A computational platform for
PT
high-resolution mass spectrometry-based metabolomics. Analytical chemistry. 2011;83:7668-75. Werner E, Heilier JF, Ducruix C, Ezan E, Junot C, Tabet JC. Mass spectrometry for the identification of
CE
the discriminating signals from metabolomics: Current status and future trends. J Chromatogr B. 2008;871:143-63.
Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, et al. Assessment of PLSDA
AC
cross validation. Metabolomics. 2008;4:81-9. Williams DK, Muddiman DC. Parts-per-billion mass measurement accuracy achieved through the combination of multiple linear regression and automatic gain control in a Fourier transform ion cyclotron resonance mass spectrometer. Anal Chem. 2007;79:5058-63. Wishart DS. Computational strategies for metabolite identification in metabolomics. Bioanalysis. 2009;1:1579-96. Wold S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics. 1978;20:397-405. Wold S, Antti H, Lindgren F, Öhman J. Orthogonal signal correction of near-infrared spectra. Chemometr Intell Lab. 1998;44:175-85. Wold S, Johansson E, Cocchi M. PLS: Partial Least Squares Projections to Latent Structures, 3D QSAR in drug design. 1993. p. 523-50. Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001a;58:109-30. Wold S, Sjöström M, Eriksson L. Partial Least Squares Projections to Latent Structures (PLS) in Chemistry.
Encyclopedia of Computational Chemistry: John Wiley & Sons, Ltd; 2002.
Wold S, Sjostrom M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemometr Intell Lab. 68
ACCEPTED MANUSCRIPT 2001b;58:109-30. Wolf S, Schmidt S, Muller-Hannemann M, Neumann S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics. 2010;11:148. Wolfender J-L, Rudaz S, Hae Choi Y, Kyong Kim H. Plant metabolomics: from holistic data to relevant
T
biomarkers. Current medicinal chemistry. 2013;20:1056-90. Wong JW, Durante C, Cartwright HM. Application of fast Fourier transform cross-correlation for the Wu
AH,
Gerona
R,
Armenian
P,
French
D,
Petrie
RI P
alignment of large chromatographic and spectral datasets. Analytical chemistry. 2005;77:5655-61. M,
Lynch
KL.
Role
of
liquid
chromatography–high-resolution mass spectrometry (LC-HR/MS) in clinical toxicology. Clinical
SC
Toxicology. 2012;50:733-42.
Xu C-J, Jiang J-H, Liang Y-Z. Evolving window orthogonal projections method for two-way data resolution. Analyst. 1999;124:1471-6.
MA NU
Xu Y, Heilier JF, Madalinski G, Genin E, Ezan E, Tabet JC, et al. Evaluation of Accurate Mass and Relative Isotopic Abundance Measurements in the LTQ-Orbitrap Mass Spectrometer for Further Metabolomics Database Building. Anal Chem. 2010;82:5490-501.
Yang J, Honavar V. Feature Subset Selection Using a Genetic Algorithm. In: Liu H, Motoda H, editors. Feature Extraction, Construction and Selection: Springer US; 1998. p. 117-36. Yi L-z, Yuan D-l, Liang Y-z, Xie P-s, Zhao Y. Fingerprinting alterations of secondary metabolites of tangerine peels during growth by HPLC-DAD and chemometric methods. Analytica Chimica Acta.
ED
2009;649:43-51.
Yi L, Dong N, Liu S, Yi Z, Zhang Y. Chemical features of Pericarpium Citri Reticulatae and Pericarpium Citri Reticulatae Viride revealed by GC–MS metabolomics analysis. Food Chemistry.
PT
Yi L, Song C, Hu Z, Yang L, Xiao L, Yi B, et al. A metabolic discrimination model for nasopharyngeal carcinoma and its potential role in the therapeutic evaluation of radiotherapy. Metabolomics.
CE
2014;10:697-708.
Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research. 2004;5:1205-24.
AC
Yun Y-H, Cao D-S, Tan M-L, Yan J, Ren D-B, Xu Q-S, et al. A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration. Chemometr Intell Lab. 2014a;130:76-83. Yun Y-H, Liang Y-Z, Xie G-X, Li H-D, Cao D-S, Xu Q-S. A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. Analyst. 2013;138:6412-21. Yun Y-H, Wang W-T, Tan M-L, Liang Y-Z, Li H-D, Cao D-S, et al. A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Anal Chim Acta. 2014b;807:36-43. Zeng Z-D, Liang Y-Z, Wang Y-L, Li X-R, Liang L-M, Xu Q-S, et al. Alternative moving window factor analysis for comparison analysis between complex chromatographic data. Journal of Chromatography A. 2006;1107:273-85. Zhang AH, Sun H, Han Y, Yan GL, Yuan Y, Song GC, et al. Ultraperformance Liquid Chromatography-Mass Spectrometry Based Comprehensive Metabolomics Combined with Pattern Recognition and Network Analysis Methods for Characterization of Metabolites and Metabolic Pathways from Biological Data Sets. Analytical Chemistry. 2013a;85:7606-12. Zhang AH, Sun H, Yan GL, Yuan Y, Han Y, Wang XJ. Metabolomics study of type 2 diabetes using ultra-performance LC-ESI/quadrupole-TOF high-definition MS coupled with pattern recognition 69
ACCEPTED MANUSCRIPT methods. Journal of Physiology and Biochemistry. 2014;70:117-28. Zhang H, Wang H, Dai Z, Chen M-s, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics. 2012a;13:1-20. Zhang LX, Tang CL, Cao DS, Zeng YX, Tan BB, Zeng MM, et al. Strategies for structure elucidation of
T
small molecules using gas chromatography-mass spectrometric data. Trac-Trend Anal Chem. 2013b;47:37-46.
RI P
Zhang Z-M, Chen S, Liang Y-Z. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst. 2010;135:1138-46.
Zhang Z-M, Liang Y-Z, Lu H-M, Tan B-B, Xu X-N, Ferro M. Multiscale peak alignment for
SC
chromatographic datasets. Journal of Chromatography A. 2012b;1223:93-106.
Zhao Z, Liu H. Searching for interacting features in subset selection. Intelligent Data Analysis. 2009;13:207-28.
MA NU
Zheng K, Li Q, Wang J, Geng J, Cao P, Sui T, et al. Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra. Chemometr Intell Lab. 2012;112:48-54.
Zhou B, Wang J, Ressom HW. MetaboSearch: tool for mass-based metabolite identification using multiple databases. PLoS One. 2012;7:e40096.
Zhu ZJ, Schultz AW, Wang JH, Johnson CH, Yannone SM, Patti GJ, et al. Liquid chromatography quadrupole time-of-flight mass spectrometry characterization of metabolites guided by the METLIN
ED
database. Nature Protocols. 2013;8:451-60.
PT
Figure legend Fig.1. A recent literature survey of the number of publications (A) and their cited
CE
times (B), searching in Web of Science (Sep 6th, 2014). Plant metabolomics was used as a key word.
AC
Fig.2. The flowchart of data analysis of plant metabolomics. Fig.3. Deconvolution results using alternative moving window factor analysis (AMWFA). Fig. 3(A) and (B) are the total ion chromatograms (TICs) of peak cluster I and II of Pericarpium Citri Reticulatae Viride (PCRV) and Pericarpium Citri Reticulatae (PCR) before deconvolution. Fig.3 (C) and (D) are the resolved chromatographic curves of peak cluster I and II, respectively. (reprinted with permission from (Wang, Yi, 2008)). Fig.4. GC-MS total ion chromatograms (TICs) of tangerine peels, before (A) and after peak alignment (B). Retention time shifts in different samples were removed successfully by multi-scale peak alignment (MSPA) approach. Fig.5. The idea and outline of model population analysis (MPA). Fig.6. The prediction error distributions of an informative, uninformative or interfering variable before (white) and after permutation (gray) 1000 times. Random 70
ACCEPTED MANUSCRIPT sampling is employed here. A. Informative variable, prediction error will increase after permutation. B. Uninformative variable, prediction error should be without significant difference before and after permutation. C..Interfering variable, prediction
T
error may decrease after permutation.
RI P
Fig.7. The idea and main results of volatile metabolic footprinting of tangerine peels collected from July to December. A. The photographs of tangerine peels. B. Metabolic footprints obtained by PCA. C. A heat map of the relative abundance levels
SC
of the volatile compounds during the ripening process. (reprinted with permission from (Yi, Dong)).
MA NU
Fig.8. Plots of PLS scores (A) and plots of cross-validated PLS scores (B) on simulated data. The 4 data sets with random values are simulated on computer. Each data set has 100 samples and the number of variables is set to 5, 50, 500 and 5000, respectively. The class label for each sample is randomly assigned. For each data set,
ED
PLS-DA and cross-validated PLS-DA is implemented. Table 1. Available databases and libraries for metabolite identification Accessa
MS Spectral Library
Current Size
Website
c
276,248(242,466)
http://www.nist.gov/srd/nist1a.cfm
Wiley Registry of Mass Spectral Data
c
670,000(570,000)
http://onlinelibrary.wiley.com/book/10. 1002/9780470175217
GolmMetabolome DatabaseRT
d
26,587
FiehnLib
d
1000
MassBank
d
40,889
NIST MS/MS Library
c
234,284(45,298)
ReSpect
d
9017
METLIN
w
61,784
AC
CE
NIST 14
PT
Name
http://gmd.mpimp-golm.mpg.de/ http://fiehnlab.ucdavis.edu/projects/Fie hnLib/index_html http://www.massbank.jp/ http://www.nist.gov/srd/nist1a.cfm http://spectra.psc.riken.jp/ http://metlin.scripps.edu
Chemical Substance Database PubChem Dabatase
Compound
d
>53 million
http://www.ncbi.nlm.nih.gov/pccompo und
ChemSpider
w
>21 million
http://www.chemspider.com/
Manchester Metabolomics Database
d
42,553
BiGG Database
w
2835
BioCyc (MetaCyc)
http://dbkgroup.org/MMD/ http://bigg.ucsd.edu/bigg
UNKNOWN
http://biocyc.org/
CAS Registry
c
>89 million
http://www.cas.org/
CSLS
w
UNKNOWN
http://cactus.nci.nih.gov/
71
ACCEPTED MANUSCRIPT d
~166 billion
Dictionary of Natural Products
c
240,007
Beilstein database
c
>500 million
KEGG ligand database
d
17,282
ChEBI
d
40,211
http://www.gdb.unibe.ch/gdb/ http://dnp.chemnetbase.com/dictionary -search.do?method=view&id=1079994 5&struct=start&props=&&si= http://www.elsevier.com/online-tools/r eaxys http://www.genome.jp/kegg/ligand.htm l http://www.ebi.ac.uk/chebi/
HMDB
d
41,806
http://www.hmdb.ca/
KNApSAcK
d
50,899
http://kanaya.naist.jp/KNApSAcK/
LIPID MAPS
d
37,566
http://www.lipidmaps.org/
LipidBank
w
7,009
SC
RI P
T
GDB databases
http://www.lipidbank.jp/
http://metlin.scripps.edu http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cr SDBS w 34,000 e_index.cgi a Access right to the database, c, d and w denote commercial, downloadable and online access, respectively. RT Retention indices are included. w
240,501
MA NU
METLIN
ED
Table 2. Available metabolite identification tools and related tools assisting metabolite identification Name
Reference
Website
MassLib MOLGEN-MS
PT
GC-MS Spectrum Identification
(Kerber, Laue, 2001)
CE
Mass Spectrum Interpreter
(Stein, 1995)
http://www.masslib.com/c http://molgen.de/?src=documents/molgenms.htmld,w http://chemdata.nist.gov/mass-spc/interpreter/d
Accurate Mass
http://www.thermoscientific.comc
MetabolitePilot
http://www.absciex.comc
AC
MetWorks
Seven Golden Rules
(Kind and Fiehn, 2007)
SIRIUS
(Bocker et al. , 2009)
MI-Pack
(Weber and Viant, 2010)
MetaboSearch
(Zhou et al. , 2012)
http://fiehnlab.ucdavis.edu/projects/Seven_Golden_Rul es/d http://bio.informatik.uni-jena.de/sirius2/d http://www.biosciences-labs.bham.ac.uk/viant/mipackd http://omics.georgetown.edu/metabosearch.htmld
MS/MS Spectrum Prediction Mass Frontier
http://www.thermoscientific.comc,g
ACD/MS Fragmenter
http://www.acdlabs.comc,g
MetISIS
(Kangas, Metz, 2012)
http://omics.pnl.gov/software/d
(Heinonen, Rantanen, 2008)
http://www.cs.helsinki.fi/group/sysfys/software/fragid/d
(Bonn, Leandersson, 2010)
http://www.moldiscovery.com/software/massmetasitec
In silico Fragmentation FiD Mass-MetaSite MetFrag 72
(Wolf, Schmidt, 2010)
http://c-ruttkies.github.io/MetFrag/d,w
ACCEPTED MANUSCRIPT (Heinonen, Shen, 2012)
https://github.com/icdishb/fingeridd
MetFusion
(Gerlich and Neumann, 2013)
http://msbi.ipb-halle.de/MetFusion/w
CFM-ID
(Allen, Greiner, 2014a, Allen et al. , 2014b)
http://cfmid.wishartlab.com/d,w
T
FingerID
(Bocker and Rasche, 2008, Rasche, Svatos, 2011)
SIRIUS2
RI P
De Novo Analysis
http://bio.informatik.uni-jena.de/sirius2/d
Molecule Ion Annotation
http://www.mcisb.org/resources/putmedid.htmld
CAMERA
(Kuhl, Tautenhahn, 2012)
http://metlin.scripps.edu/xcms/useful_links.phpd
IDEOM
(Creek, Jankevics, 2012)
http://mzmatch.sourceforge.net/ideom.phpd
MZedDB
(Draper, Enot, 2009)
Mass Spectra Deconvolution
(Stein, 1999)
ED
DeconvolutionReport ing Software
ChromaTOF® Formula Generation
PT
AnalyzerPro
CE
(Peironcely, Rojas-Cherto, 2012)
OMG
(Steinbeck et al. , 2003)
AC
The Chemistry Development Kit
http://maltese.dbs.aber.ac.uk:8888/hrmet/index.htmlw
MA NU
MAIT
AMDIS
SC
(Brown, Dunn, 2009)
PUTMEDID-LCMS
http://chemdata.nist.gov/dokuwiki/doku.php?id=chemd ata:amdisd http://www.chem.agilent.com/en-US/products-services/ Software-Informatics/Deconvolution-Reporting-Softwa re-%28DRS%29/Pages/default.aspxc http://www.spectralworks.com/analyzerpro.htmlc http://www.leco.com/products/separation-science/softw are-accessories/chromatof-softwarec
http://sourceforge.net/projects/openmg/d http://sourceforge.net/projects/cdk/d
Formula To Mass To Formula
http://www.ch.ic.ac.uk/java/applets/f2m2f/w
Molecular finder
http://www.chemcalc.org/mf_finderw
Formula
http://hires.sourceforge.net/w,d
HiRes c
d
Commercially available. Freely downloadable to the local site. interface. gAlso suitable for GC-MS spectrum.
73
w
Freely accessed via web
ACCEPTED MANUSCRIPT
PT
Table 3. A taxonomy of variable selection techniques with the mentioned methods. Consider the interaction effect among variables or not
Variable ranking or subset selection
Computa tion speedy
Reference
NO
Ranking
High
(Wold, Sjöström, 2002)
(Favilla, Durante, 2013)
Classifier
Interpretability
PLS-weights
PLS
Based on loading weight matrices of PLS modeling
PLS-VIP
PLS
Accumulate the importance of each variable being reflected by loading weights from each latent variable of PLS
NO
Ranking
High
PLS-regression coefficient
PLS
A single measure of association between each variable and the response.
NO
Ranking
High
Correlation
No classifier
NO
Ranking
High
Information gain
No classifier
NO
Ranking
High
(Wold, Sjöström, 2001a) (Hall, 1999) (Ben-Bassat, 1982)
Euclidean distance
No classifier
NO
Ranking
High
(Liang, Yang, 2008)
Mutual information
No classifier
NO
Ranking
High
(Yu and Liu, 2004)
NO
Subset selection
High
(Li, Liang, 2009a, Zheng, Li, 2012)
74
TE
D
MA
NU S
CR I
Methods
CE P
Calculate simply between variables and classification label.
PLS
GA-PLS-DA
PLS-DA
GA is used as an optimal algorithm to find the optimal subset with PLS-DA classifier.
NO
Subset selection
Low
(Cao, Sitter, 2012)
PSO-SVM
SVM
PSO is used as an optimal algorithm to find the optimal subset with SVM classification method.
NO
Subset selection
Medium
(Alba, Garcia-Nieto, 2007)
Random Forest
Decision Tree
Rank the variables by the percent increase of misclassification error when the
YES
Ranking
Medium
(Breiman, 2001)
AC
CARS
Realize a competitive feature selection based on the absolute regression coefficients.
ACCEPTED MANUSCRIPT
MIA
SVM
Give a measure based on the difference between the prediction errors of inclusion and exclusion for each variable with the margin of SVM
INTERACT
No classifier
Based on inconsistency and symmetrical uncertainty measurements for finding interacting features
Ranking
Medium
(Li, Zeng, 2010b)
YES
Ranking
Medium
(Li, Liang, 2011)
YES
Subset selection
High
(Zhao and Liu, 2009)
PLS-DA
Compute the complementary information between variables and then effectively discover biomarker with the help of mutual associations of metabolites.
YES
Ranking
Medium
(Li, Xu, 2012)
PLS
Find the optimal subset of variables through observing the difference between the prediction errors of inclusion and exclusion for each variable.
YES
Subset selection
Medium
(Yun, Wang, 2014b)
PLS
Search for the optimal variable combinations through shrinking the variable space smoothly
YES
Subset selection
Medium
(Deng, Yun, 2014)
MA
NU S
YES
IRIV
VISSA
75
CE P
AC
VCN
TE
D
SPA
CR I
PLS-DA
Identify and rank the informative variable based on the difference between the prediction errors of normal and permutated subwindow for each variable.
PT
variable is permuted randomly.
ACCEPTED MANUSCRIPT
PT
Table 4. An overview of multivariate analysis methods for modeling
Category unsupervised
Advantage Disadvantage Suit to provide an overview of a large Class information is dataset. not considered.
HCA
unsupervised
Suit to provide an overview of the clusters of samples.
SOM
unsupervised
Account for non-linear in the data
LDA
unsupervised
PLS-DA
supervised
Easy and fast. Suit to linear and low dimensional data. Particularly suit to linear and co-linear data.
OPLS-DA
supervised
Particularly suit to linear and co-linear data. Good visualization ability and interpretation ability.
Not suit to unbalanced data.
SVM
supervised
Suit to linear and nonlinear problem. High flexibility in modeling non-linear data.
Model tuning is complex
RF
supervised
Suit to linear and nonlinear problem. Resistance to outliers.
Relatively low computation speed
AC
CE P
TE
D
MA
NU S
CR I
Method PCA
76
Class information is not considered. Variable importance is not obtained. Class information is not considered. Not suit to high dimensional data Not suit to unbalanced data.
Applications in metabolomics (Mao et al. , 2014) (Zhang et al. , 2014) (Koekemoer et al. , 2012) (Wang et al. , 2009) (Draisma et al. , 2013) (Hubert et al. , 2014) (Kriegel et al. , 2009) (Makinen et al. , 2008) (Patterson et al. , 2008) Vaclavik et al. , 2012) (Yi et al. , 2014) (Dong et al. , 2012) (Rago et al. , 2013) (Varghese et al. , 2010) (Solinas et al. , 2014) (Zhang et al. , 2013a) (Chan et al. , 2009) (Holmes et al. , 2008) (Krooshof et al. , 2010) (Lin et al. , 2011) (Mahadevan et al. , 2008) (Uarrota et al. , 2014) (Bertini et al. , 2014) (Fan et al. , 2010) (Lin, Wang, 2011) (Liu et al. , 2014b)
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
Figure 1
78
PT
ED
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
AC
CE
Figure 2
79
Figure 3
80
AC
CE
PT
ED
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
SC
RI P
T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA NU
Figure 4
81
AC
Figure 5
CE
PT
ED
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
82
Figure 6
83
AC
CE
PT
ED
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
AC
Figure 7
CE
PT
ED
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
84
PT
ED
MA NU
SC
RI P
T
ACCEPTED MANUSCRIPT
AC
CE
Figure 8
85