Identification of moisture content in tobacco plant leaves using outlier sample eliminating algorithms and hyperspectral data.

Accepted Manuscript Identification of moisture content in tobacco plant leaves using outlier sample eliminating algorithms and hyperspectral data Jun Sun, xin Zhou, Xiaohong Wu, Xiaodong Zhang, Qinglin Li PII:

S0006-291X(16)30125-5

DOI:

10.1016/j.bbrc.2016.01.125

Reference:

YBBRC 35242

To appear in:

Biochemical and Biophysical Research Communications

Received Date: 14 January 2016 Accepted Date: 20 January 2016

Please cite this article as: J. Sun, x. Zhou, X. Wu, X. Zhang, Q. Li, Identification of moisture content in tobacco plant leaves using outlier sample eliminating algorithms and hyperspectral data, Biochemical and Biophysical Research Communications (2016), doi: 10.1016/j.bbrc.2016.01.125. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Identification of Moisture Content in Tobacco Plant Leaves using Outlier Sample Eliminating Algorithms and Hyperspectral Data Sun Jun1,2 Zhou xin1,2 Wu Xiaohong2 Zhang Xiaodong3 Liqinglin3 (1.Key Laboratory of Tobacco Biology & Processing,Ministry of Agriculture,Qingdao,266101,China;

RI PT

2. School of Electrical and Information Engineering of Jiangsu University, Zhenjiang 212013, China;

3.Key Laboratory of Modern Agricultural Equipment and Technology, Ministry of Education, Jiangsu University, Zhenjiang 212013, China; )

TE D

M AN U

SC

Abstract Fast identification of moisture content in tobacco plant leaves plays a key role in the tobacco cultivation industry and benefits the management of tobacco plant in the farm. In order to identify moisture content of tobacco plant leaves in a fast and nondestructive way, a method involving Mahalanobis distance coupled with Monte Carlo cross validation(MD-MCCV) was proposed to eliminate outlier sample in this study. The hyperspectral data of 200 tobacco plant leaf samples of 20 moisture gradients were obtained using FieldSpc® 3 spectrometer. Savitzky-Golay smoothing(SG), roughness penalty smoothing(RPS), kernel smoothing(KS) and median smoothing(MS) were used to preprocess the raw spectra. In addition, Mahalanobis distance(MD), Monte Carlo cross validation(MCCV) and Mahalanobis distance coupled to Monte Carlo cross validation(MD-MCCV) were applied to select the outlier sample of the raw spectrum and four smoothing preprocessing spectra. Successive projections algorithm (SPA) was used to extract the most influential wavelengths. Multiple Linear Regression (MLR) was applied to build the prediction models based on preprocessed spectra feature in characteristic wavelengths. The results 2

showed that the preferably four prediction model were MD-MCCV-SG( R p =0.8401 and

EP

RMSEP=0.1355),MD-MCCV-RPS( 2

Rp

2

=0.8030

and

RMSEP

=

0.1274),

2

MD-MCCV-KS( R p =0.8117 and RMSEP =0.1433), MD-MCCV-MS( R p =0.9132 and

AC C

RMSEP=0.1162). MD-MCCV algorithm performed best among MD algorithm, MCCV algorithm and the method without sample pretreatment algorithm in the eliminating outlier sample from 20 different moisture gradients of tobacco plant leaves and MD-MCCV can be used to eliminate outlier sample in the spectral preprocessing. Keywords: Moisture content, Tobacco, Hyperspectra, Mahalanobis distance, Monte Carlo cross validation 1 Introduction Moisture content is an important factor affecting the growth and development of crops. Suitable soil moisture content is necessary requirement for high quality production of crop, in addition, timely and precise detection of leaf and canopy moisture content can reflect the physiological status of the plant. Therefore, it is necessary for the proper amount of irrigation for the plant according to the actual situation and the law of moisture requirement. Although the spectral reflectance of canopy and leaf is sensitive to the change of moisture content in plant, it is difficult to obtain the accurate information of crop spectral changes, so the hyperspectral

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

technology is applied to the moisture detection of crop[1-3]. The crop moisture content and photosynthetic products were successfully estimated from the view point of spectral reflectance [4]. In the near infrared band spectrum[5], the stretching and bending of the O-H bonds in water molecules determines whether electromagnetic radiation is absorbed or reflected at a specific wavelength [6]. In addition, Near infrared hyperspectral technology was used to identify the moisture content in tea[7]. However, there were few reports referring to the identification of moisture content in tobacco plant leaves using the hyperspectral technology. Hyperspectral technology is a physical detection technology and the prediction of the unknown moisture content of tobacco plant leaves was carried out by establishing a hyperspectral analysis model[8-10]. Besides, the accuracy of the model can directly affect the prediction accuracy of the unknown samples. In the establishment of hyperspectral moisture content analysis model of tobacco plant leaf, there is a certain correlation between the hyperspectral and chemical value of the tobacco plant leaf. The existence of outlier samples can reduce the prediction accuracy of the model and the correlation between the spectral and chemical values. Therefore, it is necessary to distinguish and deal with outlier samples. A method involving Mahalanobis distance coupled with Monte Carlo cross validation (MD-MCCV) was proposed to eliminate outlier sample in this study. Then, hyperspectral data has the scope of large distribution and a great deal of information, so it is difficult to process the hyperspectral data directly. Therefore, spectral preprocessing algorithm, feature extraction algorithm and an optimal prediction model for identifying moisture content of tobacco plant leaves will be found in this study. 2 Materials and methods 2.1 Hyperspectral data acquisition device Hyperspectral data acquisition device is composed of portable spectrometer, the auxiliary light source, notebook computer, trays and experiment platform. FieldSpec®3 portable spectrum analyzer made by ASD company in American was used to obtain hyperspectral data, with the spectral measurement ranger of 350-2500nm. In the spectral range of 350-1000nm, the sampling interval is 1.4nm and the spectral resolution is 3nm, while in the range of 1000-2500nm, the sampling interval is 2nm and the spectral resolution is 10nm. Finally, Hyperspectral data were derived in the form of ASCII and stored in the computer for the following process. The spectral data software was ASD View SpecPro Hyperspctral data acquisition device is shown as Fig.1. The halogen lamp which has a wide range of spectrum and adjustable light was chosen as auxiliary light source and it can meet the need of spectral detection. Spectral information was captured using spectral optical fiber probe and transferred to spectrometer through an optical fiber. Signal was parsed by the spectrometer and transferred to portable computer. Hyperspectral data were read by computer using spectral analysis software and saved as binary file automatically. 2.2 Sample preparation and spectral data acquisition In this study, tobacco plant samples were grown in a tobacco planting base in Weifang, Shandong Province, China, belonging to flue-cured tobacco. This experiment was taken during a period from April 2015 to August 2015. Firstly, irrigation solutions of 20 different moisture gradients were range from 0 to 400mm, the stepping interval was 20mm. At the same time, 200 tobaccos in similar growing trend were preferred and were divided into 20 groups. Before the experiment, the tobacco plant leaves were picked during tobacco plant growing period from the same part of tobacco parietal lobe, color beautiful, no spots on the leaf, mesophyll full of tobacco

ACCEPTED MANUSCRIPT

（M1 - M 2） MC = × 100% M1

M AN U

SC

RI PT

and then stored in the plastic bag under the room temperature. The sample number of each moisture gradient is 10. The total number of the tobacco samples is 200. During the experiment of spectral data acquisition, the tobacco leaf sample was firstly placed on the black velvet and then spectral probe was place 5 cm above table, perpendicular to the tray with diameter of 50 cm. The angle of auxiliary light and experimental platform was kept as 45o and the vertical distance between auxiliary and experimental platform was 20cm. The field of view was set as 25o. Before the measurement of tobacco leaf sample, the standard reflecting plate was measured to eliminate the system error caused by the environment factors such as light intensity. Additionally, select 5 points in each tobacco leaf with the move of the tray. Finally, all the measurements of each point were repeated for 3 times and the average value was taken as the final measurement results. 2.3 Moisture measurement All samples of tobacco plant leaves were weighed by microbalance measurement instrument (Hangzhou Wante Weighing Apparatus Co. Ltd, China)and dried at 80 ◦C for 3 h in Electric heating air blower constant temperature drying box(Tianjin Macro Promise Instrument Co. Ltd, China). After drying, they were cooled in a desiccator for an hour and then weighed. The reduction in weight was taken to be the moisture content in tobacco plant leaf and was of the form[11]: (1)

AC C

EP

TE D

Where M1 is the weight of the tobacco plant leaf drying before, M2 is the weight of the tobacco plant leaf after drying, MC is the moisture content in tobacco plant leaf. 2.4 Spectral preprocessing methods In the process of spectral analysis, it includes two aspects: one is the sample pretreatment and another is the spectral pretreatment. In this paper, four spectral smoothing preprocessing algorithms were used to process the spectral data, including Savitzky-Golay smoothing(SG)[12], roughness penalty smoothing(RPS)[13], kernel smoothing(KS)[14] and median smoothing(MS)[15]. Additionally, Mahalanobis distance(MD)[16], Monte Carlo cross validation(MCCV)[17] and a method involving Mahalanobis distance coupled to Monte Carlo cross validation(MD-MCCV) were applied to evaluate the outlier sample. Finally, best spectral pretreatment method, in other words, the best combination of sample pretreatment and spectra pretreatment, will be decided according to the effect of MLR regression model. 2.5 Multiple linear regression The multiple linear regression (MLR) scheme allows modeling of the relationship between a response variable and some predictors by using a linear equation combining these predictors[18].The regression model established by MLR is shown as formula (2): Y = XB + ε (2) Where Y is the dependent variable matrix (dimension for the N × 1, N is the number for samples), X is the independent variable matrix (dimension for the N × M, M is the number of wavelengths), B is the regression coefficient matrix in MLR analysis (dimension for the M × 1) and ε is the residual matrix introduced in MLR analysis(dimension for the N × 1).Furthermore, the performance of the proposed model is evaluated by using the following indices: coefficient of 2

determination for calibration R c ，root mean square error for calibration RMSEC, coefficient of 2

determination for validation R cv , root mean square error for validation RMSECV, coefficient of

ACCEPTED MANUSCRIPT 2

determination for prediction R p , root mean square error for prediction RMSEP. Moreover, in the

AC C

EP

TE D

M AN U

SC

RI PT

case that the smaller the predicted residual sum of squares and the root mean square error are,as well as the greater the determination coefficient is, the better the performance of the model will be. 3. Principles and algorithms 3.1 Mahalanobis distance algorithm The Mahalanobis distance is a measure between two data points in the space defined by relevant features[19-21]. Since it accounts for unequal variances as well as correlations between features, it will adequately evaluate the distance by assigning different weights or importance factors to the features of data points. In addition, Mahalanobis distance metric can adjust the geometrical distribution of data so that the distance between similar data points is small. Thus, it can enhance the performance of clustering or classification algorithms. When some training cluster size is smaller than its dimension, it induces the singular problem of the inverse covariance matrix. Therefore, this paper is to calculate the distance between the sample data and the average spectrum of the calibration set, and compare with the threshold value. Then, the accuracy of the model can be improved for eliminating outlier samples. 3.2 Monte Carlo cross validation algorithm Monte Carlo cross validation (MCCV), also known as statistical simulation method, can be used to solve complex statistical model and high dimension problems[22-24]. In statistical inference the term Cross validation (CV) is usually used and has a wide meaning, however, we will use the general term validation associated with Monte Carlo Cross Validation (MCCV) to avoid ambiguity . The core of Monte Carlo cross validation algorithm is the efficient extraction of samples in the case of a given objective function distribution. Moreover, in this paper, the MCCV algorithm specific steps are as follows: Approximately 80% of the sample was selected as the calibration set, the MLR model was established, and a set of prediction residuals were obtained after multiple cycles. Then, outlier samples were eliminated according to the mean and variance of the prediction residuals. In addition, the MLR model was established for the calibration set after eliminating outlier samples, and the remaining 20% was used as the prediction set to validate the model. Finally, these parameters were used to evaluate the model and the indices were the same as that described in Section 2.5. 3.3 MD-MCCV algorithm In the present work, MD-MCCV was developed for eliminating outlier samples after spectral preprocessing. Overall, it is an effective fusion of Monte Carlo cross validation and Mahalanobis distance. Outlier samples were eliminated by calculating the related parameters of Mahalanobis distances based on the multi modeling of Monte Carlo cross validation algorithm. In addition, MD-MCCV is a relatively simple algorithm which can be summarized as the following steps: Step1: Input sample set S={(x1,y1),(x2,y2),...,(xn,yn)}, where xi ∈ X, X is the spectral values of the sample data. Y is category labels. yi ∈ {1,2,...,20}, i=1,2,3,...,n. Step 2: The matrix T of characteristic wavelengths was obtained by using the SPA algorithm to extract the sample data. Step 3: Approximately 80% of the matrix T was selected as the calibration set T1, and the

ACCEPTED MANUSCRIPT remaining 20% will be used as the prediction set T2. Step 4: Eliminating Outlier samples. (1)Set m as the cycles of MCCV, and set the threshold value of the correlation coefficient of the Markov distance. MLR loop modeling analysis of the calibration set was carried out, and the predicting residual of the calibration set was obtained. The formula of predicting residual was shown as formula (3):

PR = yˆ - y

RI PT

(3)

Where PR is the predicting residual, yˆ is the predictive value of modeling, and y is the real

STD 2 =

(PR 1 + PR 2 + ... + PR m ) m

M AN U

M=

SC

value. (2) Calculate the mean and variance of prediction residuals. Calculation formula of mean and variance were shown as formula (4, 5):

(PR 1 - M) 2 + (PR 2 - M) 2 + ... + (PR m - M) 2 m

(4)

(5)

TE D

Where M is the mean of prediction residuals. STD is the variance of predicting residual. PRi is prediction residual for the i cycles, and values for i are 1,2,3,...,n. m is the cycle times of MCCV. (3) To solve the markov distance of the mean and variance of the prediction residual. The formulation of markov distance is as following:

MD i = (x i -x)C -1 (x i -x) T 2

1 ( X c )T ( X c ) (m - 1)

EP

C=

(6)

(7)

AC C

Where MDi is the markov distance of the i set of data, xi is the i set of data, x is the mean of data in formula (6). Moreover, C is the covariance matrix, Xc is the centralized data matrix, m is the cycle times of MCCV in formula (7). (4) Outlier samples were eliminated according to the set of the threshold of the markov distance. Step 5:The MLR model was established for the calibration set after eliminating outlier samples, and the remaining 20% was used as the prediction set to validate the model. Finally, these parameters were used to evaluate the model and the indices were the same as that described in Section 2.5. 4. Results and discussion The spectra of all samples (350-2500 nm) was shown in Fig. 2 and the average spectrum of each moisture gradient of tobacco plant leaves were shown in Fig. 3. From Fig. 3, it can be seen that, there were obvious differences among the average spectra of the 20 different moisture gradients of tobacco plant leaves. In addition, the differences among the spectral data were greater at the peak of the spectral curve. Therefore, the 20 different moisture gradients of tobacco plant

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

leaves can be classified according to hyperspectral data. Furthermore, the calculation results of average moisture content in tobacco plant leaves were shown in Table 1. From Table 1, it can be seen that, there were obvious differences of the each moisture gradient of tobacco plant leaves. 4.1 Spectral pretreatment External interference, instrument noise and random error can severely reduce the accuracy of the model in the acquisition of spectral information. In order to eliminate spectral noise and improve signal to noise ratio(SNR), Savitzky-Golay smoothing(SG), roughness penalty smoothing(RPS), kernel smoothing(KS) and median smoothing(MS) were used to deal with spectral data and the processed spectral curve were shown as Fig. 4.These methods not only can improve the SNR, but also can keep the useful information of the spectral information. 4.2 Extraction of characteristic wavelengths Successive projections algorithm (SPA), which can find the redundant information from the spectral information to the set of variables, so that the total linear between the variables to achieve minimum. At the same time, it can greatly reduce the number of variables used in the modeling, improve the speed and efficiency of the model. Therefore, this paper adopted SPA algorithm for feature extraction of spectral data. Then, raw spectral data, as well as the spectral data processing by four kinds of preprocessing algorithms after the SPA feature extraction were shown in Table 2. 4.3 Sample pretreatment About 80% of the spectral data after pretreatment were selected as the calibration set and the remaining 20% will be used as the prediction set. To eliminate the outlier sample only from the calibration set. In this paper, three algorithms were used to evaluate outlier samples after smoothing, including Mahalanobis distance algorithm, Monte Carlo cross validation algorithm and MD-MCCV algorithm. The threshold was set as 3 when the MD algorithm being used. Besides, MCCV algorithm was eliminated outlier samples, and the number of cycle times was 1500.Furthermore, MD-MCCV algorithm’s cycles set for 1500 times, and the threshold of mahalanobis distance of the STD and MEAN were set as 2. The distribution of the outlier sample was presented in Table 3. 4.4 MLR models The MLR model was established for the calibration set after eliminating outlier samples, and the prediction set was used to validate the model. The result of MLR models were presented in Table 4. From Table 4, we can know that different preprocessing algorithms have some effect on the performance of the proposed model. Part of the preprocessing algorithm can improve the performance of the model, but also some of the algorithm reduces the performance of the model. From the number of samples, the smoothing preprocessing algorithm, MCCV algorithm, MD-MCCV algorithm to remove the number of outlier sample is less. From the prediction accuracy, the prediction accuracy of MD-MCCV algorithm of the model is higher than the MD algorithm, MCCV algorithm. In addition, the accuracy of the MD-MCCV-MS algorithm is the best. Therefore, the MD-MCCV algorithm can effectively improve the accuracy of modeling. Among them, the establishment of the MLR model has best prediction ability of the moisture content of tobacco leaves basising on MD-MCCV-MS as a preprocessing algorithm. Acknowledgements This work is Partially Supported by the Open Project Program of Key Laboratory of Tobacco Biology & Processing,Ministry of Agriculture, and National natural science funds projects (31471413，31401286), A Project Funded by the Priority Academic Program Development of

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Jiangsu Higher Education Institutions (PAPD), Six Talent Peaks Project in Jiangsu Province(ZBZZ-019) , Natural Science Foundation of Jiangsu Province of China(BK20141165). References [1]Bandyopadhyay K.K., Pradhan S., Sahoo R.N., Singh Ravender, Gupta V.K. , Joshi D.K. , Sutradhar A.K.. Characterization of water stress and prediction of yield of wheat using spectral indices under varied water and nitrogen management practices. Agricultural Water Management. 146(2014) 115–123. [2]Fernandes, F.A., Rodrigues, S., Law, C.L., Mujumdar, A.S., . Drying of exotic tropical fruits: a comprehensive review. Food Bioprocess Technol. 4(2011) 163–185. [3]Henrik K.N., Poul E.L., Na L, Uffe J. Sampling procedure in a willow plantation for estimation of moisture content. Biomass and Bioenergy. 78 (2015) 62 -70. [4] Sims D A, Gamon J A. Estimation of vegetation water content and photosynthetic tissue area from spectral reflectance: a comparison of indices based on liquid water and chlorophy II absorption features[J]. Remote Sensing of Environment. 84(2003) 526-537. [5]Yin Z., Lei T., Yan Q., Chen Z., Dong Y. A near-infrared reflectance sensor for soil surface moisture measurement. Comput. Electron. Agric. 99(2013) 101–107. [6]Lorente D., Aleixos N., Gómez-Sanchis J., Cubero S., García-Navarrete O., Blasco J.. Recent advances and applications of hyperspectral imaging for fruit and vegetable quality assessment. Food Bioprocess Technol. 5(2012) 1121–1142. [7] Deng Shuiguang, Xu Yifei, Li Xiaoli, HeYong. Moisture content prediction in tealeaf with near infrared hyperspectral imaging. Computers and Electronics in Agriculture. 118(2015) 38– 46. [8] Gillis N., Plemmons R., Dimensionality reduction, classification, and spectral mixture analysis using non-negative underapproximation, Opt. Eng. 50 (2) (2011), 027001-1–027001-16. [9] Fauvel M., Tarabalka Y., Benediktsson J.A., Chanussot J., Tilton J.C., Advances in spectral-spatial classification of hyperspectral images, Proc. IEEE 101 (3)(2013) 652–675. [10]Feng, Y.-Z., ElMasry, G., Sun, D.-W., Scannell, A. G. M., Walsh, D., & Morcy, N.. Near-infrared hyperspectral imaging and partial least squares regression for rapid and reagentless determination of Enterobacteriaceae on chicken fillets. Food Chemistry. 138(2013)(2–3)1829–1836. [11]Sinija, V. R., & Mishra, H. N.. FTNIR spectroscopic method for determination of moisture content in green tea granules. Food and Bioprocess Technology. 4(1)(2008) 136–141 [12] Omar Abdel-Aziz，Maha F. Abdel-Ghany，Reham Nagi，Laila Abdel-Fattah.Application of Savitzky–Golay differentiation filters and Fourier functions to simultaneous determination of cefepime and the co-administered drug, levofloxacin, in spiked human plasma.Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 139 (2015) 449–455. [13] Astrid Julliona, Philippe Lambert. Robust specification of the roughness penalty prior distribution in spatially adaptive Bayesian P-splines models. Computational Statistics& Data Analysis. 51(2007) 2542 – 2558. [14] Calfa B.A., Grossmann I.E., Agarwal A., Bury S.J., Wassick J.M.. Data-driven individual and joint chance-constrained optimization via kernel smoothing. Computers and Chemical Engineering. 78 (2015) 51–69. [15] Nathaniel E. Helwig, Yizhao Gao, Shaowen Wang, Ping Ma.Analyzing spatiotemporal trends in social media data via smoothing spline analysis of variance.Spatial Statistics.

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

14(2015).491-504. [16] Igor Melnykov, Volodymyr Melnykov. On K-means algorithm with the use of Mahalanobis distances. Statistics and Probability Letters. 84(2014) 88–95. [17] Khaled Haddad, Ataur Rahman, Mohammad A Zaman, Surendra Shrestha. Applicability of Monte Carlo cross validation technique for model development and validation using generalised least squares regression.Journal of Hydrology. 482(2013) 119– 128. [18] Velo A., Pérez F.F., Tanhua T., Gilcoto M., Ríos A.F., Key R.M.. Total alkalinity estimation using MLR and neural network techniques. Journal of Marine Systems. 111–112(2013) 11–18. [19] Lin Haijun，Zhang Huifang，Gao Yaqi，et al.Mahalanobis Distance Based Hyperspectral Characterisitic Discrimination of Leaves of Different Desert Tree Species.Spectroscopy and Spectral Analysis. 12(34)(2014) 3358-3362.(in Chinese with English abstract) [20]Zhang Yong, Huang Dan, Ji Min, Xie Fuding. Image segmentation using PSO and PCM with Mahalanobis distance.Expert Systems with Applications. 38(2011) 9036–9040. [21] SHI Huai-Tao, LIU Jian-Chang, XUE Peng, ZHANG Ke, WU Yu-Hou, ZHANG Li-Xiu, TAN Shuai.Improved Relative-transformation Principal Component Analysis Based on Mahalanobis Distance and Its Application for Fault Detection.Acta Automatica Sinica.9(39) (2013) 1533-1542. [22] Zangian M., Minuchehr A. , Zolfaghari A..Development and validation of a new multigroup Monte Carlo Criticality Calculations (MC3) code.Progress in Nuclear Energy.81 (2015) 53-59. [23] Farah J., Bonfrate A. , De Marzi L., De Oliveira A., Delacroix S., Martinetti F. , Trompier F., Clairand I.. Configuration and validation of an analytical model predicting secondary neutron radiation in proton therapy using Monte Carlo simulations and experimental measurements. Physica Medica. 31(2015) 248-256. [24] Zavorka L., Adam J. , ArtiushenkoM. , BaldinA.A., et al.Validation of Monte Carlo simulation of neutron production in a spallation experiment.Annals of Nuclear Energy. 80(2015) 178-187.

Fig.1 Schematic representation of hyperspectral data acquisition device

SC

RI PT

ACCEPTED MANUSCRIPT

EP

TE D

M AN U

Fig. 2 Raw spectra of all tobacco samples

AC C

Fig. 3 Average spectra of 20 different moisture gradients of tobacco leaves

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

TE D

Fig.4 Spectral curve of tobacco leaves samples after different preprocessing methods. (a),(b),(c),(d) respectively expressed the spectra of all samples preprocessed by SG, spectra of all samples preprocessed by RPS, spectra of all samples preprocessed by KS and spectra of all samples preprocessed by MS Table 1 Different labels of average moisture contents in tobacco leaves MC/%

1

0.4077

2

0.4168

3

0.4578

4

0.4983

5

0.5051

6

0.5152

7

0.5684

8

0.6203

9

0.676

10

0.6925

11

0.6967

12

0.7362

13

0.7601

14

0.7832

15

0.7894

16

0.798

AC C

EP

Tobacco Label

ACCEPTED MANUSCRIPT 17

0.809

18

0.8239

19

0.838

20

0.8728

Table 2 Characteristic wavelengths extracted by Successive Projections Algorithm NO.

Characteristic wavelengths/nm

Raw data

7

648,826,883,1002,1593,2367,2369

SG

6

632,826,881,1601,1925,2343

RPS

7

711,800,825,1000,1708,1882,2381

KS

6

608,828,890,2009,2431,2491

MS

8

594,629,726,783,1309,1604,2216,2283

SC

RI PT

Preprocessing algorithm

Table 3 Results of the data conducted by MD, MCCV and MD-MCCV NO.1

NO.2

NO.3

MD-SPA-RD

160

157

3

MD-SPA-SG

160

MD-SPA-RPS

160

M AN U

Processing algorithms

154

6

151

9

152

8

144

16

153

7

150

10

149

11

160

145

15

160

149

11

160

146

14

160

137

23

160

133

27

MD-MCCV-SPA-KS

160

140

20

MD-MCCV-SPA-MS

160

133

27

160 160

MCCV-SPA-RD

160

MCCV-SPA-SG

160

MCCV-SPA-RPS

160

MCCV-SPA-KS

TE D

MD-SPA-KS MD-SPA-MS

MCCV-SPA-MS MD-MCCV-SPA-RD MD-MCCV-SPA-SG

EP

MD-MCCV-SPA-RPS

Note:RD is the Raw data. NO.1 is the number of calibration set. NO.2 is the calibration set data after processing.

AC C

NO.3 is the number of outlier samples.

Table 4 Comparison of the results of MLR for different pre-processing methods Model

Pre-processing method

R c2

2 R cv

R 2p

RMSEC

RMSECV

RMSEP

1

RD

0.2534

0.1163

0.0865

0.1290

0.1427

0.1832

2

SG

0.4049

0.3865

0.3564

0.1339

0.1402

0.1753

3

RPS

0.3975

0.3756

0.3487

0.1314

0.1467

0.1743

4

KS

0.3845

0.3675

0.3521

0.1361

0.1484

0.1721

5

MS

0.5268

0.5023

0.4965

0.0989

0.1132

0.1361

6

MD-RD

0.3354

0.3087

0.2856

0.0133

0.0144

0.0242

7

MD-SG

0.6627

0.6176

0.6023

0.0143

0.0155

0.0254

8

MD-RPS

0.5686

0.5574

0.5338

0.0277

0.0299

0.0366

ACCEPTED MANUSCRIPT MD-KS

0.5701

0.5632

0.4662

0.0219

0.0232

0.0286

MD-MS

0.7958

0.7856

0.7594

0.0051

0.0058

0.0094

11

MCCV-RD

0.5292

0.5062

0.4897

0.1375

0.1513

0.1424

12

MCCV-SG

0.7547

0.7376

0.7165

0.1320

0.1416

0.1154

13

MCCV-RPS

0.7162

0.7084

0.6982

0.1286

0.1384

0.1166

14

MCCV-KS

0.7511

0.7431

0.7390

0.1348

0.1450

0.1186

15

MCCV-MS

0.8613

0.8549

0.8323

0.1147

0.1255

0.0935

16

MD-MCCV-RD

0.7331

0.7165

0.6876

0.1351

0.1473

0.1422

17

MD-MCCV-SG

0.8771

0.8543

0.8401

0.1344

0.1445

0.1355

18

MD-MCCV-RPS

0.8734

0.8331

0.8030

0.1314

0.1427

0.1274

19

MD-MCCV-KS

0.8684

0.8261

0.8117

0.1326

0.1444

0.1433

20

MD-MCCV-MS

0.9444

0.9277

0.9132

0.1169

0.1293

0.1162

AC C

EP

TE D

M AN U

SC

RI PT

9 10

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Biochemical and Biophysical Research Communications COMMUNICATIONS follows the ICMJE recommendations regarding conflict of interest disclosures. All authors are required to report the following information with each submission:(1) A method involving Mahalanobis distance coupled with Monte Carlo cross validation(MD-MCCV) was proposed to eliminate outlier sample in this study. (2) The tobacco moisture detection model was established in this study. (3) This will be beneficial to the field irrigation management of tobacco planting in this study. (4) Spectral preprocessing algorithm, feature extraction algorithm and an optimal prediction model for identifying moisture content of tobacco plant leaves will be found in this study.

Feasibility of using hyperspectral imaging to predict moisture content of porcine meat during salting process.

Identification of different varieties of sesame oil using near-infrared hyperspectral imaging and chemometrics algorithms.

Meta-Analysis of the Detection of Plant Pigment Concentrations Using Hyperspectral Remotely Sensed Data.

Adaptive Grouping Distributed Compressive Sensing Reconstruction of Plant Hyperspectral Data.

Hyperspectral image classification using functional data analysis.

Manometric method for determining bicarbonate content in plant leaves.

ZODET: software for the identification, analysis and visualisation of outlier genes in microarray expression data.

Fuel moisture content enhances nonadditive effects of plant mixtures on flammability and fire behavior.

Oil Adulteration Identification by Hyperspectral Imaging Using QHM and ICA.

Identification of smoking using Medicare data--a validation study of claims-based algorithms.

Application of Visible and Near-Infrared Hyperspectral Imaging to Determine Soluble Protein Content in Oilseed Rape Leaves.

Outlier Identification in Model-Based Cluster Analysis.

Feature weighting algorithms for classification of hyperspectral images using a support vector machine.

Influence of Initial Moisture Content on Heat and Moisture Transfer in Firefighters' Protective Clothing.

Amyloplast formation in cultured tobacco cells; effects of plant hormones on multiplication, size, and starch content.

QUEST: Eliminating Online Supervised Learning for Efficient Classification Algorithms.

Color measurement of tea leaves at different drying periods using hyperspectral imaging technique.

Identification of pulses in hormone time series using outlier detection methods.

Hyperspectral IASI L1C Data Compression.

Rapid Prediction of Moisture Content in Intact Green Coffee Beans Using Near Infrared Spectroscopy.

High Throughput In vivo Analysis of Plant Leaf Chemical Properties Using Hyperspectral Imaging.

Cancer progression modeling using static sample data.

A novel quantitative approach for eliminating sample-to-sample variation using a hue saturation value analysis program.

Algorithms for quantitative quasi-static elasticity imaging using force data.