## Servicios Personalizados

## Revista

## Articulo

## Indicadores

- Citado por SciELO
- Accesos

## Links relacionados

- Citado por Google
- Similares en SciELO
- Similares en Google

## Compartir

## Journal of the Chilean Chemical Society

##
*versión On-line* ISSN 0717-9707

### J. Chil. Chem. Soc. v.53 n.4 Concepción dic. 2008

#### http://dx.doi.org/10.4067/S0717-97072008000400016

J. Chil. Chem. Soc, 53, N° 4 (2008) págs: 1709-1713

**SUPERVISED PATTERN RECOGNITION TECHNIQUES FOR CLASSIFICATION OF EUCALYPTUS SPECIES **

**FROM LEAVES NIR SPECTRA**

**ROSARIO CASTILLO ^{A}, DAVID CONTRERAS^{3*}, JUANITA FREER^{A,B*}, JOSE RUIZ^{A}, SOFÍA VALENZUELA^{A,C}.**

^{a} *Renewable Resources Laboratory, Biotechnology Center, University of Concepcion, Chile ^{b }Faculty of Chemical Sciences, University of Concepcion, Chile ^{C}Faculty of Forestry Sciences, University of Concepcion, Chile*

**ABSTRACT**

Three supervised pattern recognition methods *(SPRM) *were evaluated to discriminate between *Eucalyptus globulus *and *Eucalyptus nitens *species applying near infrared *(NIR) *spectroscopy on leaves. The methods used were k-nearest neighbor *(KNN),*** **soft modeling class analogy

*(SIMCA)*and discriminant partial least squares

*(PLS-DA).*First and second derivatives were used as transform techniques and mean-center

*(MC)*and autoscaling

*(AS)*as preprocessing techniques. The training set was constitued by 288 samples and 20 samples were used as validation set. A significant difference between the assayed methods was not observed, however best results for separation of classes and prediction rate were obtained when first derivative and MC were used for all the recognition pattern methods. Use of leaves and

*NIR*spectroscopy avoids the destructive usual wood analysis in forest industries and facilities the fast classification of these species for forest applications.

**Key words: ***Pattern recognition, NIR, Eucalyptus.*

**INTRODUCTION**

The genus *Eucalyptus *(Myrtaceae), gathers around of 600 species, widely distributed in the Southern hemisphere, mainly in Australia and Tasmania^{1}. Within its 7 subgenus * ^{2}, Symphyomyrtus *includes the species with major commercial valué, among them,

*Eucalyptus nitens*and

*Eucalyptus globulus*

^{3}. These species are widely used in forest industries, due to its fast growing, easy adaptability to new soils and environments and its high pulp yield,

^{3,4}. The wood of these species have different chemical and mechanical properties and prices. The fast differentiation of these species is not possible without the use of botanical or genomic methodologies over the plant.

Supervised pattern recognition methods *(SPRM) *^{5}, have been used for classification of biological species ^{6-8} and to determine the geographical and botanical origin of plant derivatives^{9}. Near infrared (NIR) spectroscopy, has been used together with pattern recognition methods in order to solve problems about plants classification ^{10}^{-12}, based on morphological ^{13-14}, chemical ^{15} and genetics properties ^{16}. NIR, has also been used on wood samples for forestry applications, specially in prediction of its chemical composition ^{7,13,17-20 }mechanical properties, ^{18,20} and species identification^{6,21}.

On *Eucalyptus *species, Michell et al. ^{7} developed a method based in principal components analysis (PCA) and a soft independent modeling of class analogy (SIMCA) for classification *oíE. globulus *and *E. nitens *species. Also, Schimleck et al. ^{22} used NIR spectroscopy with partial least squares (PLS) for predicting *E. globulus *and *E. nitens *wood chemical composition and Bailleréis et al. ^{23} used NIR spectroscopy as a tool for rapid screening of some major wood characteristics in an *Eucalyptus *breeding program. It is highlighted that in all of those works, wood samples were used for NIR spectra, PC A does not allow to predict classes, and SIMCA models has not properly separated the classes. In our knowledge there are not classification models developed for a differentiation of *Eucalyptus *species using NIR reflectance spectroscopy and leaves as samples.

The aim of this study was evalúate the supervised pattern recognition methods *(SPRM): *k-nearest neighbor (KNN), SIMCA and discriminant partial least squares (PLS-DA) with different techniques of preprocessing and transforms in order to obtain a fast and accurate classification of *E. globulus *and *E. nitens, *using NIR reflectance spectra. The comparison between the three methods was developed evaluating parameters of each method on training and prediction sets and using statistical McNemar's Test ^{15} with the objective to find the best model for separation of the species and the prediction of the class of unknown samples.

**2. Methods description**

*2.1. Exploratory analysis*

Two unsupervised pattern recognition methods, principal component analysis (PCA) and hierarquical cluster analysis (HCA) were used for the exploratory analysis of the data.

PCA^{24}, pro vides a way to reduce dimensionality of the data finding lineal combinations of independent variables. PCA express a matrix *X *as a product of the other two matrices, the score matrix *T *and the transpose of the score loadings matrix *P*according to:

The columns of *P ^{T} *are the principal components (PC's), the elements of the first column of loadings, indícate the contribution of the origináis variables to the first principal component (PC 1) and the matrix of scores

*T*is the proj ection of the samples over the axes defined by the loadings

^{25}. Associated with each factor is a PC which expresses the magnitude of variance captured by each factor.

HCA ^{26,27}, calculates and compares distances between pairs of samples. Relatively small distances between samples implies that the samples are similar. Dissimilar samples will be separated by relatively large distances. Euclidian distance *d _{ab} *between two samples vectors,

*a*and

*b,*is determined by computing differences at each of the

*m*variables

^{5}according to:

Inter-sample distances are transformed in a standard scale corresponding to a similarity Índex calculated by:

*2.2. Supervised Pattern Recognition Methods*

*SPRM *use the objects of known class for which a certain number of variables have been measured, as a training set, and a later and independent set for the evaluation of the performance of the model through of a validation step ^{5,28}. In this work we used the KNN, SIMCA and PLS-DA methods.

KNN method^{29}, attempts to categorize an unknown based on its proximity to samples placed in categories ^{30}. The predicted class of an unknown sample depends on the class of its *k *nearest neighbors. The multivariate distance used in KNN is similar to the euclidean distance *d _{ab}*calculated in HCA, where

*a*and

_{j}*b*are the data vector for the two samples for variable

_{j}*j*and

*m*is the number of variables. Euclidian distance separates each pair of samples in the training set and stores them in atable of distances. For class prediction of any sample, the classes of its neighbors can be tallied and the sample is assigned to the class to which most of nearest neighbors belong. The optimal number of

*k*nearest neighbors is determined by cross validation procedure

^{31}, where each object in the training set is taken out and considered as a validation sample. This process is performed for

*k =1*to

*n-1*and the number

*k*nearest neighbors with the lowest error rate is chosen.

SIMCA method ^{32}, develops a PCA for each training set category by iteratives NIPALS algorithm ^{33}. SIMCA determines the number of PC's or eigenvectors needed to describe the structure of the training class by cross validation^{5}. PC models has the structure described by:

The total residual variance of class is a measure of the samples compaction inside the class.

For a new object x the scores would be determined by the following equation.

Then, *S ^{2} is *calculated by each sample simililarly to Eq. (5), and a statistic test i

^{7}with determined level of probability is applied.

*If S*of the new object is less to the critical standard deviation (& ) obtained by the residuals of a training class, the new object belongs to class

*K,*otherwise it does not

^{5}. If the new object qualifies as a member of both classes, the class having the smallest sample residual is considered the best, and the other class is deemed next best. If the new object exceeds both critical valúes, it is assigned to neither class. The valúes of these residuals are often called distances and accumulated in a class distance object. The information in this object can be presented in a multiplot view where pairwise combinations of classes form the subplot axes

^{25}. This plot currently denominated Coomans Plot

^{34}shows four quadrants divided by two lines indicating the

*S*valúes for each training set.

PLS-DA^{35}, uses regression for classification trough of partial least square (PLS), a regression method that find latent variables in the features spaces which have a máximum covariance within the predictor variables ^{36}. The input feature of dependent variables is the matrix * X, *and the assignment to class is described in the

*Y*matrix, which has columns with valué of 0 or 1 for each object. Regression coefficient matrix

*is calculated with the training set according by:*

**B***2.3 Pre-processing and transform*

*MC *permits* center *the data about the mean. This mean is computed for each variable according to:

The mean is then substracted from each data value to produce a mena centered matrix defined by Eq. (12), where *i *is the sample and *j *is the variable

*AS, *is a processing thechnique that apply a mean-centering followed by variance scaling on the data. Autoscaling of the data is calculated according to the follow equation:

Where *s *is the standard deviation for the variable *j.*

First derivative and second derivative are based on a Savitzky - Golay polynomial filter ^{41}. This method applies a convolution to independent variables in a window containing a center data point and *n *points on either side. A weighted second-order polynomial is fit to these 2n + 1 points and the center point is replaced by the fitted valué.

**EXPERIMENTAL**

*3.1 Samples*

Leaves of *Eucalyptus *species were obtained from forestry industries of Bío-bío Región, Chile. Three hundred eight samples corresponding to 258 samples of *E. globulus *(8 maintained under field conditions and 258 under nursery conditions) and 50 samples *of E. nitens *(all mantained under nursery conditions), were divided by random selection in two sets as follows: 288 samples as training set for the construction of classification models and 20 samples as validation set. Validation set had 10 samples *ofE. globulus *and 10 samples of *E. nitens. *Leaves were dried at 50°C by 24 hours, pulverized and sifted through 30 Mesh for NIR spectroscopic analysis.

*3.2 Acquisition ofNIR spectra*

A NIR reflectance Perkin Elmer spectrometer (Identicheck FT-NIR, Beaconsfield, England) was used for the collection of NIR reflectance spectra expressed as percentage of reflectance *(%R). *For the NIR spectra acquisition, 0.5 g *of Eucalyptus *leaves were dried and pulverized, then 50 scans in a range of wavenumber from 10000 to 4000 cm^{1} (1000 - 2500 nm) with resolution of 8 cm^{-1} were performed for each sample. Duplícate spectra were measured for each sample and an average of them was obtained using the Perkin Elmer software. The spectra were transformed to absorbance using log(l/i?) for chemometric analysis (Fig. 1).

*3.3 Spectral data pre-treatments*

*MC *and *AS *were the pre-processing techniques while first and second derivatives were used as transform techniques.

*3.4. Chemometric Analysis*

PCA, HCA, SIMCA, KNN and PLS-DA methods were performed using Piruette 3.11 (Infometrix Inc). Recognition and prediction ability were evaluated by the members percentage correctly classified on the training and validation set (rate prediction), respectively.

*a) Exploratory analysis*

For PCA, the computation of PC's were developed using nonlinear iterative partial least squares (NIPALS) algorithm ^{42,43 }over the training set data. The optimal number of PC's was determined based on the cumulative variance percent of PC's.

For the HCA, the Euclidian distances between pairs of training set samples was calculated and a dendogram was built using incremental linkage method^{2! }for calcúlate the distance between clusters.

*b) Pattern recognition methods*

KNN, SIMCA and PLS-DA were applied over the training set previously transformed with first or second derivative and pre-processed with *MC *or *AS *techniques for obtain predictive models. Then, these models were used to classify samples of validation set. Rate prediction (percentage of samples correctly classified in the validation set) was calculated for all the methods.

Cooman's plots^{44} were used for the discrimination of classes and *S _{0} *was calculated according Eq. (6) for evauate the class compaction in each model.

*3.5. Statistical comparison of methods*

McNemar's test, a particular case of Fisher's test, is used when the same sample is assayed on two occasions ^{5}. This test is based on a χ^{2}* *test with one degree of freedom. Details of this test in classification are described in Roggo et al. (2003)^{15}. This test has been used for SPRM comparisons in application to qualitative analysis of sugar beet^{15}. The χ^{2}* *critical valué with 5% level of significance and one degree of freedom, χ^{}^{1}_{(1,0.95)}, is3.84. If χ^{2} valué islessthan 3.84 the nuil hypothesis is true, and the two algorithms are not significantly different, but if χ^{2} is over 3.84, the nuil hypothesis is false and the two algorithms are significantly different, with a probability of 95%.

**RESULTS AND DISCUSSION**

*4.1 Exploratory Analysis*

PCA of the NIR spectra shows separation of the samples in two non overlapping groups, corresponding to *E. nitens *and *E. globulus. *Scores plot (Fig. 2) shows the best separation whenMC and first derivative were used, with 65% of the cumulative variance explained by the three first PC's. Use of second derivative shows less cumulative variance valúes and when it is accompanied with *A U, *the species cannot be properly separated. The data classification in species by PCA allows to use classification methods based on PC's; in this way, SIMCA and PLS-DA methods were performed.

When HCA was carried out, all the used pre-treatments, separated the samples in two main clusters pertaining to the *Eucalyptus *species. However, only first derivative shows less similiraty index between the species and greater similarity index between the samples in each specie. The two main clusters showed 0.214 and 0.267 as maximal similarity index between the species for the AS-first derivative andMC-first derivative pre-treatments, respectively. The species classification by HCA, allows to create a KNN model.

*4.2 Classification methods*

*a) KNN*

Optimal number of neighbors in *KNN *was 1 for all the algorithms used. The results showed that the first and second derivative were goodtransformation techniques for classification of the modeling set. Non significant differences were found between the transform and preprocessing techniques over the validation set, since that all the methods classified correctly 100% of the samples.

*b) SIMCA*

All the utilized algorithms classified the data of the training set in two separated classes. The Cooman's plot (Fig. 3), shows thatMC - first derivative pre-treatment produces the best separation of the samples in two classes, placing all the samples of each class under their respective S valúes. Both pre-processing techniques accompanied by second derivative, showed all samples into of overlap classes quadrant, over the *S *valúes, indicating that the samples can belong to both classes, although no overlap between *E. globulus *andi?. *nitens *samples was observed. Models usingMC - first derivative showed also the best interclass distance and greater cumulative variance explained with the optimal number of PC's, calculated by cross validation (Table 1); however, models with second derivative showed less valúes *of S ^{2} *than models with first derivative, indicating higher compaction of the classes (Table 1).

It is highlightedthat non significant differences between rate prediction over validation set were found using McNemar's test with the different transform and pre-processing techniques; thus, *MC - *first derivative and *AS *-first derivative classified correctly 95% of the samples, *MC - *second derivative, 100% and *A U- *second derivative 90%. SIMCA models for classification of *eucalyptus *species previously reported, using wood as samples^{7}, showed a Cooman's plot with non properly separated classes, it shows the advantages to use leaves and the application of pre-treatments to the spectra.

*c) PLS-DA*

PLS-DA onthe training set, with optimum number of PC's determined by the lowest valúes of the standard calibration error (SEC), predicted residual error sum of squares (PRESS), and the percentage of cumulative variance for each class, showed also that *MC - *first derivative is the best model for classification of the classes (Table 2), although all the PLS-DA models predicted correctly 100% of validation set samples .

*4.3. Comparison of classification models*

There are no significant differences in the prediction rate on validation set between algorithms used in independent studies into SIMCA, KNN and PLS-DA models, either between the three studied models; however, all the methods show best results for separation of the classes in the training set when first derivative andMC were used. Although the prediction rate in the validation set is good for all the used methods, the comparison between the pre-treatments of the data in the training set, shows the negative influence of the application of the second derivative over our data. This is in agreement with the knowledge of the noise increase when this transform technique is applied. SIMCA model is able to distinguish the best separation of the classes and the influence of the pre-treatments of our data in the training or prediction set, with a known significance level, and thus this method could be considered the best choice for posterior classification of new samples.

It is important to mention that this work shows the use of *SPRM, *not only for separation of the classes, as it has been previously reported with the use of wood ^{7,21}, but also, shows the predictive ability of these methods in order to get a fast, reliable and accurate classification of new samples in the forest industry. The models grouped the samples of *E. globulus *mantained under field and nursery conditions in the same class. Since there are not significant differences in the prediction rates on validation set, results of Mc'Nemars test are not shown.

**CONCLUSIONS**

All the analized methods are abil to classify samples *of Eucalyptus *species. There are not significant difference betweenthe prediction ability of the methods utilized, however the SIMCA method is more adequate for differentiation of botanical species by NIR spectroscopy, because this method is not determinist like KNN and PLS-DA, this means that SIMCA gives a probability to belong to a class, both classes or it is possible to recognize when the sample does not belong to the classes studied. Best pre-treatment of the data was the apphcation of mean-centered and first derivative of the spectra.

The NIR spectroscopy of leaves together with *SPRM *models could avoid the need of botanical classification by genetic or morphologic methods and the use of wood as samples for classification in forest applications.

**ACKNOWLEDGMENTS**

The authors thank financial support of INNOVA Bío-bío N° 03-B1-210-Ll project and DIUC 206.021.024-1.0. R. Castillo thanks to DAAD for PhD. scholarship. Authors thank the collaboration of Prof. Dr. Matthias Otto.

**REFERENCES**

1. G. H. Chippendale, Myrtaceae - Eucalyptus - Angophora, Australian Gout. Publishing Service, Canberra, 1988.

2. L. D. Pryor and L. A. S. Johnson, A classification of the eucalyptus, Australian National University Press . Canberra, 1971.

3. J. W. Turnbull, *New Forests. *17, 37, (1999).

4. C. A. Raymond, *Ann. Forest. Sci. *59, 525, (2002).

5. B. G. M. Vandeginste, D. L. Massart, L. M. C. Buydens, S. De Jong, P. J. Lewi and J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part B, Elsevier, Amsterdam, 1998.

6. N. Gierlinger, M. Schwanninger and R. Wimmer, *J. Near Infrared. Spec. *12, 113,(2004).

7. A. J. Michell and L. R. Schimleck, *Appita *J. 51, 127, (1998).

8. Y.A. Woo, H.-J. Kim and J. Cho,*Microchem. J. *63, 61, (1999).

9. R. Fernandez-Torres, J. L. Perez-Bernal, M. A. Bello-López, M. Callejon-Mochon, J. C. Jiménez-Sánchez and A. Guiraum-Perez, *Talanta. *65, 686, (2005).

10. M. J. Saiz-Abajo, J. M. Gonzalez-Saiz and C. Pizarro, *J. Near Infrared. Spec. *12, 207, (2004).

11. M. Andre, Anal. Chem. 75, *3460, *(2003).

12. Q. S. Chen, J. W. Zhao, H. D. Zhang, M. H. Liu and M. Fang, *J. Near Infrared. Spec. *13, 327, (2005).

13. P. D. Jones, L. R. Schimleck, G. F. Peter, R. F. Daniels and A. Clark, *Wood Sci. Technol. *39, 529, (2005).

14. M. Tigabu and P. C. Oden, *New Forests. *25, 163, (2003).

15. Y. Roggo, L. Duponchel and J.-P. Huvenne, *Anal. Chim. Acta. *477, 187, (2003).

16. L. Xie, Y. Ying, T. Ying, H. Yu and X. Fu, *Anal. Chim. Acta. ***584, **379, (2007).

17. A. J. Michell, *Appita J. *47, 29, (1994).

18. J. B. Hauksson, G. Bergqvist, U. Bergsten, M. Sjostrom and U. Edlund, *Wood Sci. Technol. ***35, **475, (2001).

19. A. Terdwongworakul, V. Punsuwan, W. Thanapase and S. Tsuchikawa, *J. Wood Sci. *51, 167,(2005).

20. S. S. Kelley, T. G. Riáis, L. R. Groom and C. L. So, *Holzforschung. ***58, **257, (2004).

21. M.Brunner, R. Eugster, E.Trenka and L. BerganminStrotz, *Holzforschung. ***50, **130, (1996).

22. L. R. Schimleck, A. J. Michell and C. A. Raymond, *Appita J. ***53, **318, (2000).

23. H. Bailleres, F. Davrieus and F. H. Pichavant, *Ann. Forest. Sci. *59, 479, (2002).

24. J. T. Jolliffe, Principal Component Analysis, Springer, New York, 2002.

25. I. T. i. A. C. Inc, Pirouette Multivariate Data Analysis Software User Guide, Bothell, WA, 2003.

26. J. H. Ward, *J. Am. Stat. Assoc. ***58, **236, (1963).

27. R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, New York, 2001.

28. P. K. Hopke, *Anal.* *Chim. Acta. ***500, **365,(2003).

29. M. P. Derde, L. Buydens, C. Guns, D. L. Massart and P. K. Hopke, *Anal. Chem. *59, 1868, (1987).

30. B. R. Kowalski and C. F. Bender,* Anal.**Chem. ***44, **1405, (1972).

31. B. K. Alsberg, R. Goodacre, J. J. Rowland and D. B. Kell, *Anal. Chim. Acta. ***348, **389, (1997).

32. S. Wold and M. Sjostrom, in: B. R. Kowalski (Ed.), Chemometrics: Theory and Apphcation, ACS Symposium Series, Washington, DC, 1977, p. 243.

33. M. Otto, Chemometrics. Statistics and computer apphcation in analytical chemistry, Wiley-VCH, Weinheim, 2007.

34. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decisión Making, John Wiley, Letchworth, 1986.

35. S. W. Lars Stáhle, *J. Chemometr. *1, 185,(1987).

36. D. L. Massart, B. G. M. Vandeginste, L. M. C. Buydens, S. De Jong, P. J. Lewi and J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part A, Elsevier, Amsterdam, 1997.

37. A. J. Burnham, R. Viveros and J. F. MacGregor, *J. Chemometr. *10, 31. (1996).

38. S. Masoum, D. J.-R. Bouveresse, J. Vercauteren, M. Jalali-Heravi and D. N. Rutledge, *Anal. Chim. Acta. ***558, **144, (2006).

39. Z. Ramadan, D. Jacobs, M. Grigorov and S. Kochhar, *Talanta. *68, 1683, (2006).

40. C. Y. Pierce, J. R. Barr, A. R. Woolfitt, H. Moura, E. I. Shaw, H. A. Thompson, R. F. Massung and F. M. Fernandez, *Anal. Chim. Acta. ***583, **23, (2007).

41. A. Savitzky and M. J. E. Golay, *Anal. Chem. *36, 1627,(1964).

42. H. Wold, in: P. R. Krishnaiah (Ed.), Multivariate Analysis, Academic Press, New York, 1966, p. 391.

43. T. I. Yoshikatsu Miyashita, Hiroyuki Katsumi, Shin-Ichi Sasaki,, *J. Chemometr. *4, (1990).

44. D. Coomans, I. Broeckaert, M. P. Derde, A. Tassin, D. L. Massart and S. Wold, *Comput. Biomed. Res. *17, 1, (1984).

(Received: January 24, 2008 - Accepted: July 20, 2008)

* ^{*} Corresponding author. Tel.: +56-41-2204601; Fax: +56-41-2245974 *e-mail address:

__jfreer@udec.cl__,

__dcontrer@udec.cl__