Journal of the Chilean Chemical Society
On-line version ISSN 0717-9707
J. Chil. Chem. Soc. vol.53 no.4 Concepción Dec. 2008
J. Chil. Chem. Soc, 53, N° 4 (2008) págs: 1709-1713
SUPERVISED PATTERN RECOGNITION TECHNIQUES FOR CLASSIFICATION OF EUCALYPTUS SPECIES FROM LEAVES NIR SPECTRA
a Renewable Resources Laboratory, Biotechnology Center, University of Concepcion, Chile
b Faculty of Chemical Sciences, University of Concepcion, Chile
CFaculty of Forestry Sciences, University of Concepcion, Chile
Three supervised pattern recognition methods (SPRM) were evaluated to discriminate between Eucalyptus globulus and Eucalyptus nitens species applying near infrared (NIR) spectroscopy on leaves. The methods used were k-nearest neighbor (KNN), soft modeling class analogy (SIMCA) and discriminant partial least squares (PLS-DA). First and second derivatives were used as transform techniques and mean-center (MC) and autoscaling (AS) as preprocessing techniques. The training set was constitued by 288 samples and 20 samples were used as validation set. A significant difference between the assayed methods was not observed, however best results for separation of classes and prediction rate were obtained when first derivative and MC were used for all the recognition pattern methods. Use of leaves and NIR spectroscopy avoids the destructive usual wood analysis in forest industries and facilities the fast classification of these species for forest applications.
Key words: Pattern recognition, NIR, Eucalyptus.
The genus Eucalyptus (Myrtaceae), gathers around of 600 species, widely distributed in the Southern hemisphere, mainly in Australia and Tasmania1. Within its 7 subgenus 2, Symphyomyrtus includes the species with major commercial valué, among them, Eucalyptus nitens and Eucalyptus globulus 3. These species are widely used in forest industries, due to its fast growing, easy adaptability to new soils and environments and its high pulp yield,3,4. The wood of these species have different chemical and mechanical properties and prices. The fast differentiation of these species is not possible without the use of botanical or genomic methodologies over the plant.
Supervised pattern recognition methods (SPRM) 5, have been used for classification of biological species 6-8 and to determine the geographical and botanical origin of plant derivatives9. Near infrared (NIR) spectroscopy, has been used together with pattern recognition methods in order to solve problems about plants classification 10-12, based on morphological 13-14, chemical 15 and genetics properties 16. NIR, has also been used on wood samples for forestry applications, specially in prediction of its chemical composition 7,13,17-20 mechanical properties, 18,20 and species identification6,21.
On Eucalyptus species, Michell et al. 7 developed a method based in principal components analysis (PCA) and a soft independent modeling of class analogy (SIMCA) for classification oíE. globulus and E. nitens species. Also, Schimleck et al. 22 used NIR spectroscopy with partial least squares (PLS) for predicting E. globulus and E. nitens wood chemical composition and Bailleréis et al. 23 used NIR spectroscopy as a tool for rapid screening of some major wood characteristics in an Eucalyptus breeding program. It is highlighted that in all of those works, wood samples were used for NIR spectra, PC A does not allow to predict classes, and SIMCA models has not properly separated the classes. In our knowledge there are not classification models developed for a differentiation of Eucalyptus species using NIR reflectance spectroscopy and leaves as samples.
The aim of this study was evalúate the supervised pattern recognition methods (SPRM): k-nearest neighbor (KNN), SIMCA and discriminant partial least squares (PLS-DA) with different techniques of preprocessing and transforms in order to obtain a fast and accurate classification of E. globulus and E. nitens, using NIR reflectance spectra. The comparison between the three methods was developed evaluating parameters of each method on training and prediction sets and using statistical McNemar's Test 15 with the objective to find the best model for separation of the species and the prediction of the class of unknown samples.
2. Methods description
2.1. Exploratory analysis
Two unsupervised pattern recognition methods, principal component analysis (PCA) and hierarquical cluster analysis (HCA) were used for the exploratory analysis of the data.
PCA24, pro vides a way to reduce dimensionality of the data finding lineal combinations of independent variables. PCA express a matrix X as a product of the other two matrices, the score matrix T and the transpose of the score loadings matrix Paccording to:
The columns of PT are the principal components (PC's), the elements of the first column of loadings, indícate the contribution of the origináis variables to the first principal component (PC 1) and the matrix of scores T is the proj ection of the samples over the axes defined by the loadings 25. Associated with each factor is a PC which expresses the magnitude of variance captured by each factor.
HCA 26,27, calculates and compares distances between pairs of samples. Relatively small distances between samples implies that the samples are similar. Dissimilar samples will be separated by relatively large distances. Euclidian distance dab between two samples vectors, a and b, is determined by computing differences at each of the m variables 5 according to:
Inter-sample distances are transformed in a standard scale corresponding to a similarity Índex calculated by:
2.2. Supervised Pattern Recognition Methods
SPRM use the objects of known class for which a certain number of variables have been measured, as a training set, and a later and independent set for the evaluation of the performance of the model through of a validation step 5,28. In this work we used the KNN, SIMCA and PLS-DA methods.
KNN method29, attempts to categorize an unknown based on its proximity to samples placed in categories 30. The predicted class of an unknown sample depends on the class of its k nearest neighbors. The multivariate distance used in KNN is similar to the euclidean distance dabcalculated in HCA, where aj and bj are the data vector for the two samples for variable j and m is the number of variables. Euclidian distance separates each pair of samples in the training set and stores them in atable of distances. For class prediction of any sample, the classes of its neighbors can be tallied and the sample is assigned to the class to which most of nearest neighbors belong. The optimal number of k nearest neighbors is determined by cross validation procedure 31, where each object in the training set is taken out and considered as a validation sample. This process is performed for k =1 to n-1 and the number k nearest neighbors with the lowest error rate is chosen.
SIMCA method 32, develops a PCA for each training set category by iteratives NIPALS algorithm 33. SIMCA determines the number of PC's or eigenvectors needed to describe the structure of the training class by cross validation5. PC models has the structure described by:
The total residual variance of class is a measure of the samples compaction inside the class.
For a new object x the scores would be determined by the following equation.
Then, S2 is calculated by each sample simililarly to Eq. (5), and a statistic test i7 with determined level of probability is applied. If S of the new object is less to the critical standard deviation (& ) obtained by the residuals of a training class, the new object belongs to class K, otherwise it does not5. If the new object qualifies as a member of both classes, the class having the smallest sample residual is considered the best, and the other class is deemed next best. If the new object exceeds both critical valúes, it is assigned to neither class. The valúes of these residuals are often called distances and accumulated in a class distance object. The information in this object can be presented in a multiplot view where pairwise combinations of classes form the subplot axes25. This plot currently denominated Coomans Plot34 shows four quadrants divided by two lines indicating the S valúes for each training set.
PLS-DA35, uses regression for classification trough of partial least square (PLS), a regression method that find latent variables in the features spaces which have a máximum covariance within the predictor variables 36. The input feature of dependent variables is the matrix X, and the assignment to class is described in the Y matrix, which has columns with valué of 0 or 1 for each object. Regression coefficient matrix B is calculated with the training set according by:
2.3 Pre-processing and transform
MC permits center the data about the mean. This mean is computed for each variable according to:
The mean is then substracted from each data value to produce a mena centered matrix defined by Eq. (12), where i is the sample and j is the variable
AS, is a processing thechnique that apply a mean-centering followed by variance scaling on the data. Autoscaling of the data is calculated according to the follow equation:
Where s is the standard deviation for the variable j.
First derivative and second derivative are based on a Savitzky - Golay polynomial filter 41. This method applies a convolution to independent variables in a window containing a center data point and n points on either side. A weighted second-order polynomial is fit to these 2n + 1 points and the center point is replaced by the fitted valué.
Leaves of Eucalyptus species were obtained from forestry industries of Bío-bío Región, Chile. Three hundred eight samples corresponding to 258 samples of E. globulus (8 maintained under field conditions and 258 under nursery conditions) and 50 samples of E. nitens (all mantained under nursery conditions), were divided by random selection in two sets as follows: 288 samples as training set for the construction of classification models and 20 samples as validation set. Validation set had 10 samples ofE. globulus and 10 samples of E. nitens. Leaves were dried at 50°C by 24 hours, pulverized and sifted through 30 Mesh for NIR spectroscopic analysis.
3.2 Acquisition ofNIR spectra
A NIR reflectance Perkin Elmer spectrometer (Identicheck FT-NIR, Beaconsfield, England) was used for the collection of NIR reflectance spectra expressed as percentage of reflectance (%R). For the NIR spectra acquisition, 0.5 g of Eucalyptus leaves were dried and pulverized, then 50 scans in a range of wavenumber from 10000 to 4000 cm1 (1000 - 2500 nm) with resolution of 8 cm-1 were performed for each sample. Duplícate spectra were measured for each sample and an average of them was obtained using the Perkin Elmer software. The spectra were transformed to absorbance using log(l/i?) for chemometric analysis (Fig. 1).
3.3 Spectral data pre-treatments
MC and AS were the pre-processing techniques while first and second derivatives were used as transform techniques.
3.4. Chemometric Analysis
PCA, HCA, SIMCA, KNN and PLS-DA methods were performed using Piruette 3.11 (Infometrix Inc). Recognition and prediction ability were evaluated by the members percentage correctly classified on the training and validation set (rate prediction), respectively.
a) Exploratory analysis
For PCA, the computation of PC's were developed using nonlinear iterative partial least squares (NIPALS) algorithm 42,43 over the training set data. The optimal number of PC's was determined based on the cumulative variance percent of PC's.
For the HCA, the Euclidian distances between pairs of training set samples was calculated and a dendogram was built using incremental linkage method2! for calcúlate the distance between clusters.
b) Pattern recognition methods
KNN, SIMCA and PLS-DA were applied over the training set previously transformed with first or second derivative and pre-processed with MC or AS techniques for obtain predictive models. Then, these models were used to classify samples of validation set. Rate prediction (percentage of samples correctly classified in the validation set) was calculated for all the methods.
Cooman's plots44 were used for the discrimination of classes and S0 was calculated according Eq. (6) for evauate the class compaction in each model.
3.5. Statistical comparison of methods
McNemar's test, a particular case of Fisher's test, is used when the same sample is assayed on two occasions 5. This test is based on a χ2 test with one degree of freedom. Details of this test in classification are described in Roggo et al. (2003)15. This test has been used for SPRM comparisons in application to qualitative analysis of sugar beet15. The χ2 critical valué with 5% level of significance and one degree of freedom, χ1(1,0.95), is3.84. If χ2 valué islessthan 3.84 the nuil hypothesis is true, and the two algorithms are not significantly different, but if χ2 is over 3.84, the nuil hypothesis is false and the two algorithms are significantly different, with a probability of 95%.
RESULTS AND DISCUSSION
4.1 Exploratory Analysis
PCA of the NIR spectra shows separation of the samples in two non overlapping groups, corresponding to E. nitens and E. globulus. Scores plot (Fig. 2) shows the best separation whenMC and first derivative were used, with 65% of the cumulative variance explained by the three first PC's. Use of second derivative shows less cumulative variance valúes and when it is accompanied with A U, the species cannot be properly separated. The data classification in species by PCA allows to use classification methods based on PC's; in this way, SIMCA and PLS-DA methods were performed.
When HCA was carried out, all the used pre-treatments, separated the samples in two main clusters pertaining to the Eucalyptus species. However, only first derivative shows less similiraty index between the species and greater similarity index between the samples in each specie. The two main clusters showed 0.214 and 0.267 as maximal similarity index between the species for the AS-first derivative andMC-first derivative pre-treatments, respectively. The species classification by HCA, allows to create a KNN model.
4.2 Classification methods
Optimal number of neighbors in KNN was 1 for all the algorithms used. The results showed that the first and second derivative were goodtransformation techniques for classification of the modeling set. Non significant differences were found between the transform and preprocessing techniques over the validation set, since that all the methods classified correctly 100% of the samples.
All the utilized algorithms classified the data of the training set in two separated classes. The Cooman's plot (Fig. 3), shows thatMC - first derivative pre-treatment produces the best separation of the samples in two classes, placing all the samples of each class under their respective S valúes. Both pre-processing techniques accompanied by second derivative, showed all samples into of overlap classes quadrant, over the S valúes, indicating that the samples can belong to both classes, although no overlap between E. globulus andi?. nitens samples was observed. Models usingMC - first derivative showed also the best interclass distance and greater cumulative variance explained with the optimal number of PC's, calculated by cross validation (Table 1); however, models with second derivative showed less valúes of S2 than models with first derivative, indicating higher compaction of the classes (Table 1).
It is highlightedthat non significant differences between rate prediction over validation set were found using McNemar's test with the different transform and pre-processing techniques; thus, MC - first derivative and AS -first derivative classified correctly 95% of the samples, MC - second derivative, 100% and A U- second derivative 90%. SIMCA models for classification of eucalyptus species previously reported, using wood as samples7, showed a Cooman's plot with non properly separated classes, it shows the advantages to use leaves and the application of pre-treatments to the spectra.
PLS-DA onthe training set, with optimum number of PC's determined by the lowest valúes of the standard calibration error (SEC), predicted residual error sum of squares (PRESS), and the percentage of cumulative variance for each class, showed also that MC - first derivative is the best model for classification of the classes (Table 2), although all the PLS-DA models predicted correctly 100% of validation set samples .
4.3. Comparison of classification models
There are no significant differences in the prediction rate on validation set between algorithms used in independent studies into SIMCA, KNN and PLS-DA models, either between the three studied models; however, all the methods show best results for separation of the classes in the training set when first derivative andMC were used. Although the prediction rate in the validation set is good for all the used methods, the comparison between the pre-treatments of the data in the training set, shows the negative influence of the application of the second derivative over our data. This is in agreement with the knowledge of the noise increase when this transform technique is applied. SIMCA model is able to distinguish the best separation of the classes and the influence of the pre-treatments of our data in the training or prediction set, with a known significance level, and thus this method could be considered the best choice for posterior classification of new samples.
It is important to mention that this work shows the use of SPRM, not only for separation of the classes, as it has been previously reported with the use of wood 7,21, but also, shows the predictive ability of these methods in order to get a fast, reliable and accurate classification of new samples in the forest industry. The models grouped the samples of E. globulus mantained under field and nursery conditions in the same class. Since there are not significant differences in the prediction rates on validation set, results of Mc'Nemars test are not shown.
All the analized methods are abil to classify samples of Eucalyptus species. There are not significant difference betweenthe prediction ability of the methods utilized, however the SIMCA method is more adequate for differentiation of botanical species by NIR spectroscopy, because this method is not determinist like KNN and PLS-DA, this means that SIMCA gives a probability to belong to a class, both classes or it is possible to recognize when the sample does not belong to the classes studied. Best pre-treatment of the data was the apphcation of mean-centered and first derivative of the spectra.
The NIR spectroscopy of leaves together with SPRM models could avoid the need of botanical classification by genetic or morphologic methods and the use of wood as samples for classification in forest applications.
The authors thank financial support of INNOVA Bío-bío N° 03-B1-210-Ll project and DIUC 206.021.024-1.0. R. Castillo thanks to DAAD for PhD. scholarship. Authors thank the collaboration of Prof. Dr. Matthias Otto.
1. G. H. Chippendale, Myrtaceae - Eucalyptus - Angophora, Australian Gout. Publishing Service, Canberra, 1988. [ Links ]
2. L. D. Pryor and L. A. S. Johnson, A classification of the eucalyptus, Australian National University Press . Canberra, 1971. [ Links ]
3. J. W. Turnbull, New Forests. 17, 37, (1999). [ Links ]
4. C. A. Raymond, Ann. Forest. Sci. 59, 525, (2002). [ Links ]
5. B. G. M. Vandeginste, D. L. Massart, L. M. C. Buydens, S. De Jong, P. J. Lewi and J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part B, Elsevier, Amsterdam, 1998. [ Links ]
6. N. Gierlinger, M. Schwanninger and R. Wimmer, J. Near Infrared. Spec. 12, 113,(2004). [ Links ]
7. A. J. Michell and L. R. Schimleck, Appita J. 51, 127, (1998). [ Links ]
8. Y.A. Woo, H.-J. Kim and J. Cho,Microchem. J. 63, 61, (1999). [ Links ]
9. R. Fernandez-Torres, J. L. Perez-Bernal, M. A. Bello-López, M. Callejon-Mochon, J. C. Jiménez-Sánchez and A. Guiraum-Perez, Talanta. 65, 686, (2005). [ Links ]
10. M. J. Saiz-Abajo, J. M. Gonzalez-Saiz and C. Pizarro, J. Near Infrared. Spec. 12, 207, (2004). [ Links ]
11. M. Andre, Anal. Chem. 75, 3460, (2003). [ Links ]
12. Q. S. Chen, J. W. Zhao, H. D. Zhang, M. H. Liu and M. Fang, J. Near Infrared. Spec. 13, 327, (2005). [ Links ]
13. P. D. Jones, L. R. Schimleck, G. F. Peter, R. F. Daniels and A. Clark, Wood Sci. Technol. 39, 529, (2005). [ Links ]
14. M. Tigabu and P. C. Oden, New Forests. 25, 163, (2003). [ Links ]
15. Y. Roggo, L. Duponchel and J.-P. Huvenne, Anal. Chim. Acta. 477, 187, (2003). [ Links ]
16. L. Xie, Y. Ying, T. Ying, H. Yu and X. Fu, Anal. Chim. Acta. 584, 379, (2007). [ Links ]
17. A. J. Michell, Appita J. 47, 29, (1994). [ Links ]
18. J. B. Hauksson, G. Bergqvist, U. Bergsten, M. Sjostrom and U. Edlund, Wood Sci. Technol. 35, 475, (2001). [ Links ]
19. A. Terdwongworakul, V. Punsuwan, W. Thanapase and S. Tsuchikawa, J. Wood Sci. 51, 167,(2005). [ Links ]
20. S. S. Kelley, T. G. Riáis, L. R. Groom and C. L. So, Holzforschung. 58, 257, (2004). [ Links ]
21. M.Brunner, R. Eugster, E.Trenka and L. BerganminStrotz, Holzforschung. 50, 130, (1996). [ Links ]
22. L. R. Schimleck, A. J. Michell and C. A. Raymond, Appita J. 53, 318, (2000). [ Links ]
23. H. Bailleres, F. Davrieus and F. H. Pichavant, Ann. Forest. Sci. 59, 479, (2002). [ Links ]
24. J. T. Jolliffe, Principal Component Analysis, Springer, New York, 2002. [ Links ]
25. I. T. i. A. C. Inc, Pirouette Multivariate Data Analysis Software User Guide, Bothell, WA, 2003. [ Links ]
26. J. H. Ward, J. Am. Stat. Assoc. 58, 236, (1963). [ Links ]
27. R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, New York, 2001. [ Links ]
28. P. K. Hopke, Anal. Chim. Acta. 500, 365,(2003). [ Links ]
29. M. P. Derde, L. Buydens, C. Guns, D. L. Massart and P. K. Hopke, Anal. Chem. 59, 1868, (1987). [ Links ]
30. B. R. Kowalski and C. F. Bender, Anal.Chem. 44, 1405, (1972). [ Links ]
31. B. K. Alsberg, R. Goodacre, J. J. Rowland and D. B. Kell, Anal. Chim. Acta. 348, 389, (1997). [ Links ]
32. S. Wold and M. Sjostrom, in: B. R. Kowalski (Ed.), Chemometrics: Theory and Apphcation, ACS Symposium Series, Washington, DC, 1977, p. 243. [ Links ]
33. M. Otto, Chemometrics. Statistics and computer apphcation in analytical chemistry, Wiley-VCH, Weinheim, 2007. [ Links ]
34. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decisión Making, John Wiley, Letchworth, 1986. [ Links ]
35. S. W. Lars Stáhle, J. Chemometr. 1, 185,(1987). [ Links ]
36. D. L. Massart, B. G. M. Vandeginste, L. M. C. Buydens, S. De Jong, P. J. Lewi and J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part A, Elsevier, Amsterdam, 1997. [ Links ]
37. A. J. Burnham, R. Viveros and J. F. MacGregor, J. Chemometr. 10, 31. (1996). [ Links ]
38. S. Masoum, D. J.-R. Bouveresse, J. Vercauteren, M. Jalali-Heravi and D. N. Rutledge, Anal. Chim. Acta. 558, 144, (2006). [ Links ]
39. Z. Ramadan, D. Jacobs, M. Grigorov and S. Kochhar, Talanta. 68, 1683, (2006). [ Links ]
40. C. Y. Pierce, J. R. Barr, A. R. Woolfitt, H. Moura, E. I. Shaw, H. A. Thompson, R. F. Massung and F. M. Fernandez, Anal. Chim. Acta. 583, 23, (2007). [ Links ]
41. A. Savitzky and M. J. E. Golay, Anal. Chem. 36, 1627,(1964). [ Links ]
42. H. Wold, in: P. R. Krishnaiah (Ed.), Multivariate Analysis, Academic Press, New York, 1966, p. 391. [ Links ]
43. T. I. Yoshikatsu Miyashita, Hiroyuki Katsumi, Shin-Ichi Sasaki,, J. Chemometr. 4, (1990). [ Links ]
44. D. Coomans, I. Broeckaert, M. P. Derde, A. Tassin, D. L. Massart and S. Wold, Comput. Biomed. Res. 17, 1, (1984). [ Links ]
(Received: January 24, 2008 - Accepted: July 20, 2008)