Journal of the Chilean Chemical Society
On-line version ISSN 0717-9707
J. Chil. Chem. Soc. vol.52 no.3 Concepción Sept. 2007
J. Chil. Chem. Soc, 52, N° 3 (2007) págs.: 1237-1239
VECTOR DOT PRODUCT USED FOR REDUCED THREE INDEPENDENT VARIABLES OF MULTIVARIATE REGRESSION TO A LINEAR REGRESSION WITH ONE INDEPENDENT VARIABLE. ALCOHOLS USED LIKE A MODEL.
1Departamento de Química Inorgánica y Analítica, Facultad de Ciencias Químicas y Farmacéuticas, Universidad de Chile, Casilla 233, Santiago, Chile.
The aim of this work is based in the reduction of independent variables in multivariate regression analysis to one by means a vector dot product (E3). By this way, it is omit the orthogonalized procedure to obtained valid regression equation without co-linearity variables and valid signs supporting each independent variables factor, also by this procedure (E3) it is possible to omit variable reduction process by means the Principal Components Analysis (PCA) and the used of others calibrations techniques in order to reach simples valid regressions functions. The reduction of three independent variables to one by (E3) method, permit to applied linear regression (y = m*x + n) with clear significance on m and n parameters, this not occur in the original three-independent variable parameters regression, if it is not properly treatise.
In the QSPR multivariate regression equations, the real significance of all factors and signs affecting each independent variable are obtained if orthogonal procedure1 is carry on, or the reductions number of poor significant independent variables by means of Principal Component Analysis (PSA)2 is applied. By other hand, is very important to considered the number of independent variables used in the mathematical regressions, its must be in accordance with the number of cases treatise, if not, the correlation determination coefficients (R2) value is false by excess3. Other important aspect to be considered in multivariate regression analysis is the collianearity of the independent variables, this occur when the regression of each independent variable is correlated in turn against the other variables and the regressions determinant coefficient (R2) are superior to 0.900 value4.
Others multivariate calibrations techniques are frequently applied in conjunction with PSA technique on multivariate functions, these techniques included multiple linear regression (MLR) used in this article, partial least-squares regression (PLS), continuum regression (CR), projection pursuit regression (PPR) locally weighted regression (LWR) and artificial neural network (ANNs) among others. Each of these methods possesses its own strengths and weaknesses, and which works best for a given problem depends on the characteristics of the data and objective of the analysis5. In quantitative structure-activity relationships studies (QSAR) principal component analysis followed by sample selection to fit factorial and fractional factorial designs has been reported6
More extensive multivariate calibration methodology is not used in this paper because it is an introduction one to propose a new idea, with a few numbers of cases.
Reduction method presented in this work, eliminated these troubles by using a linear simple regression (y = m*E3 + n) where E3 is function of three optimal variables chosen of a group of nine variables. E3 is obtained by vector dot product. A similar reduction idea where proposed on V3 index by the author7 applied to saturated hydrocarbons but the calculus for obtained the variable reduction is different and with statistically results no so good for polar substances (alcohols).
The model used in this work consist in twenty seven alcohols whose boiling points used like dependent variable where extracted from the literature8 and for each one of then, eight physicochemical parameter where chosen and one well-known topological index named Electrotopological index910 (E te) was used. For this reduction procedure is necessary used a maximum three independent variable by each multivariate regression, in accordance with the number of cases treatise3. Based on a combination procedure, forty eight regressions were made, each one with three independent variable using alcohols boiling point (Bp °C) like dependent variable, and from this forty eight modeling multi-regression calculated, one of them was chosen like the best in accordance with common regression statistical criteria. The structure of this model correspond to equation 1
i mean an alcohol, xi corresponds to E-Estate topological index, yi correspond to partition ration of octanol/water (log P) n, z. corresponds to molecular surface (Ao)2 (S) 11. Other physicochemical parameters11 considered were: molecular volume, density, refraction index, polarizability, dipolar momentum and hydratation energy. None of then gave better results like the three ones mentioned before. All physicochemical values were obtained by Hyperchem 7 program11 and E-Estate index, was obtained by Dragon Software10 by this way it was establish the triad elements belonging to nine independent variables set which permit to obtained the best multi-regression and this relation was compared statistically against the linear regression (y = m*E3 + n) resulting by the reduction procedure through vector dot product (E3)
E3 parameter was obtained by the following processes:
The Q matrix rows were building by triads of alcohols independent variables corresponding to physicochemical parameters that were used in the optimal multi-regression. To applied mechanism reduction (E3) was necessary to have defined a vector of three independent variables used like comparative vector. From twenty seven comparative vectors, only one representing the average (p) values of each parameter class produced the best results (an acceptable calculated alcohols boiling point vs. E3) This was defined like comparative vector [Xp Yp , Zp] the p symbol represent average value.
i denoted a particular alcohol. The result is a scalar number that is possible to associate with any dependent variable, in this case the alcohols boiling points.
PROCEDURE AND DISCUSSION
Twenty seven alcohols are characterize by a three optimal independent variables: E-Estate, log P, molecular surface area (S), (Ao)2 and the boiling point (Bp.°C) like dependent variable, see Table 1 The particular structure of equation 1 is obtained by Statgraphic program12) corroborated by Origin 7 program13 and by the theory based in linear algebra applied to multi-regressions14 this equation number 3 is.
Since the P-value in ANO VA analysis is less than 0.01 there is a statistically significant relationship between the variable at the 99% confidence level. The R-Squared statistic indicates that the model as fitted explain 92.52% of the variability in boiling point. The adjusted R-squared statistic, which is more suitable for comparing models with different numbers of independent variables, is 91.5 %
The mean absolute error (MAE) is 4.51 and indicated the average value of residual
The study of collinearity4) (R2 > 0.90) present the following relations:
One way to checking for multicollinearity is to regress each independent variable in turn against all other predictors and to examine the statistically R2 values, if its value goes above 90.0% multicollinearity is said to be a problem and is necessary othogonalized the system or to used PCA method.
This result indicated collinearity between the independent variables. In part it can be simplified because the P-values of log P on regression is 0.1867, Since the P-value is greater or equal to 0.10, this variables is not statistically significant at the 90% or higher confidence level. Consequently, its possible considers removing log P from the model that is not the case for this study.
Table 1 columns 10, 11 are present the calculated boiling points values from linear equation (y = m*E3 + n) and the residuals of experimental and calculated boiling points.
The mean absolute error (MAE) is 6.30 and it indicated the average residual value.
Table 1 columns 8, 9 are the calculated boiling points values from multivariate regression and the residuals of experimental and calculated boiling points.
Table 1 column 7 are present vector product values, E3 = [Xi Yi , Zi] * [Xp Yp , Zp] where p indicated the average values from each column ( 4, 5,6). The specific equation corresponding to that proposition is: y = m*E3 + n named equation 4
The factor standard errors of multivariable regression are more significative than n, m factors standard errors of proposed model, see P-values, Table 2 and Table 3 The negative signs of the EEstate have not physicochemical significance because the derivative function of boiling point vs. EEstat is positive (derivative of boiling point vs. EEstate is +3.59) in accordance to the following relation: to a greater number of EEstat correspond a greater boiling point and consequently a greater molecular weight. Standardized skewness and standardized kurtosis are for both differences (Table 1 column 9, 11) within the range of-2 to +2 validating the following statistically parameters. An analysis of the statistically differences between experimental boiling points and calculated boiling point for both regression models (column 9, 11) using Statgraphic11 software indicated that: there are not statistically significance differences between the means, standard deviation, median and distribution (Kolmogorov-Smirnov test) at 95.0% confidence level. Really, the factors and signs of the multivariate regression correlation do not have physicals sustenance, only is possible to use as a model to obtained calculated dependent variable, with spurious interpretation on independent variables factor and in many cases the signs of factors are wrong. For this reason is necessary applied an orthogonal method to multivariable regression or to use the method described in this paper to obtained a model consistent with a physicochemical interpretation.
DISCUSSION AND CONCLUSSIONS
Both models present similar differences of experimental boiling points vs. calculated boiling points but multivariate regression analysis model have not clear define the signs and magnitude affecting each independent variable. The model proposed in this paper is easy to obtain and its positive slope is on accordance with all positive slopes of the following derivatives: d Bp °C/ d E-Estate, d Bp °C/d log P, d Bp °C/d(S)
1. M. Randic. J. Chem.Inform.Comput. Sci. 37, 672 (1997). [ Links ]
2. R. C. Graham "Data Analysis of the Chemical Sciences. A Guide to Statistical Techniques" U.C. Publisher. Inc (1993) page 329-346. [ Links ]
3. J. C. Toplis, R. P. Edwards. J. Med. Chem. 22, 1238 (1979). [ Links ]
4. Co lineal http://18.104.22.168/multivar/mr.htm [ Links ]
5. P. D. Wentzell, D. T. Andrews. Anal. Chem. 69, 2299 (1997). [ Links ]
6. J. Ferré, F. X. Rius. Anal. Chem. 68. 1565 (1996). [ Links ]
7. E. Cornwell. J. Chil. Chem. Soc. 51(1) 765, (2006). [ Links ]
8. D. R. Lide., H. P. R. Frederikse. "CRC Handbook of Chemistry and Physics" 75th Edition CRC Press, INC, 1995. [ Links ]
9. L. H. Hall. L.B. Kier. J. Chem. Inf. Comput. Sci. 40, 784 (2000). [ Links ]
11. Hyperchem. Release 7.01 for Windows "Molecular Model System" (Evaluation Copy) Copyright 2002 Hypercube. Inc. [ Links ]
12. Statgraphic Plus 5.1 Copyrighy 1994-2001 Statistically Graphic Corp. [ Links ]
13. Origin 73R1 V7. 0301 (B30019) Copyright © 1991-2002 Origin Lab. Corporation. One Round Plaza Northampton, MA 01060 USA. [ Links ]
14. D. L. Massart, B.G.M. Vadegisnte., S.N.N. Deming., Y Machotte., L.Kaufman "Chemometric a textbook". Elsevier Scientific Publishing Company, Amsterdam, 1998. [ Links ]
(Received in March 2007 - Accepted 1st June 1 2007)
Corresponding author: e-mail: email@example.com