Journal of the Chilean Chemical Society
versión ISSN 0717-9707
J. Chil. Chem. Soc. vol.56 no.3 Concepción 2011
J. Chil. Chem. Soc., 56, N° 3 (2011), págs.: 746-751.
SUPPORT VECTOR MACHINE REGRESSION FOR REACTIVITY PARAMETERS OF VINYL MONOMERS
XINLIANG YU*,1, XUEYE WANG2 AND JIANFANG CHEN1
1 College of Chemistry and Chemical Engineering, Hunan Institute of Engineering, Xiangtan, Hunan 411104, China. e-mail: email@example.com
2 Key Laboratory ofEnvironmentally Friendly Chemistry and Applications of Ministry ofEducation, College of Chemistry, Xiangtan University, Xiangtan, Hunan 411105, China.
Recently, the support vector machine (SVM), as a novel type of learning machine, has been introduced to solve chemical problems. In this study, å- support vector regression (å-SVR) and v-support vector regression (v-SVR) were, respectively, used to construct quantitative structure-property relationship (QSPR) models of Q and e parameters in the Q-e scheme, which is remarkably useful in the interpretation of the reactivity of a monomer in free-radical copolymerizations. The quantum chemical descriptors used to developed the SVR models were calculated from styrene and radicals with structures CH3CH2C1H2-C2HR3· (C1H2=C2HR3 + CH3CH2· CH3CH2C1H2-C2HR3·). The optimum å-SVR model of lnQ (C= 9, å =0.05 and ã =0.2) and the optimum v-SVR model of e (C=100, v = 0.5 and ã =0.4) produced low root mean square (rms) errors for prediction sets: 0.318 and 0.266, respectively. Thus, applying SVR to predict parameters Q and e is successful.
Keywords: free-radical copolymerizations; Q-e scheme; quantum chemical descriptors; structure-property relations, support vector machine.
Quantitative structure-property relationship (QSPR) studies for prediction of chemical and physical properties of molecules are unquestionably important in modern chemistry, 1 especially for the cases where the reliable experimental data are difficult to obtain from experiments. Usually, a satisfactory QSPR model can serve as a guide to chemists, because it can be used to select molecules (including those not yet synthesized) with the desired properties. Thus, the QSPR approach conserves resources and accelerates the process of development of new molecules for any purpose. 1
The Q-e scheme, as the most widely used general reactivity scheme, is remarkably useful in the interpretation of the reactivity of a monomer in free-radical copolymerizations. 2, 3 In the scheme, the parameter Q measures the general reactivity of a monomer (or a radical) and energetic properties (i.e. thermodynamic properties); the parameter e measures some polar properties of a monomer (or a radical), i.e. the supposed permanent electric charge resulting in mutual attraction or repulsion between the two monomers (or radicals). 4 Many researchers have predicted the reactivity parameters Q and e with QSPR approaches by using the multiple linear regression (MLR) technique and/or artificial neural network (ANN) approaches. 2, 5-8 In fact, Q and e values are correlated with the reference monomer. 2, 9-11 While these descriptors used in these QSPR models did not include the information of the reference monomer. In the present work, support vector machine (SVM) models are developed to predict the Q and e values with quantum chemical descriptors obtained from radicals CH3CH2C1H2-C2HR3· (C1H2=C2HR3 + CH3CH2· CH3CH2C1H2-C2HR3·) and the reference monomer styrene.
MATERIALS AND METHODS
Tables 1 and 2 show 60 vinyl monomers with double bonds and their respective experimental lnQ and e values. These lnQ and e values were based on the reference monomer styrene (Q = 1.0 and e = -0.8) and taken from the literature.12 These monomers comprise a variety of substitution groups, such as esters, ethers, sulfides, halides, ketones, acids, amides, aromatic and nonaromatic rings. The data were randomly split into three sets (in the ratio 50%, 25% and 25%): a training set (30 monomers), a validation set (15 monomers) and a test set (15 monomers). The training set was used to train SVM models, the validation set was used to optimize the parameters of SVM models, and the prediction set was used to evaluate its prediction ability.
a The unit ofEB : 1 Hartree = 2.62 5 5 x106 J/mol; the unit of DMC2 and q ,DMR3 : 1 electron = 1.602188x10-19C;
a The unit of E agand eg 1 Hartree = 2.62 5 5 x106 J/mol.
Previous works have found that atomic charges and frontier molecular orbital energies are related to lnQ and e. 2, 6-9 Thus, these descriptors were calculated for radicals CH3CH2C1H2-C2HR3· with density functional theory (DFT) in Gaussian 03 13 program, at the UB3LYP level of theory with 6-31G(d) basis set. Calculations were also performed on styrene using the same methods. Descriptors calculated from radicals include the energies of the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) of alpha spin states (Eohomo and Eolumo), the energies of HOMO and LUMO of beta spin states (Ejjhomo and Ejlumo), the energy gap between HOMO and LUMO of alpha spin states (Eag) and the energy gap between HOMO and LUMO of beta spin (EEjg), Mulliken charges of R3 (qmr3), and Mulliken atomic spin densities on C2 (Dmc2). The descriptors for styrene were Mulliken atomic charges of R3 (qsmr3) and the HOMO energy (E __,״). In addition, we defined two descriptors and e . The former was equal to the absolute value of the Mulliken charge difference on R3 between radicals and styrene, i.e., qdmr3 = abs(qmr3 qsmr3). The latter is the absolute value of the energy difference between the LUMO energy of beta spin states (Ejlumo) of radicals and the HOMO energy of styrene. it can be expressed as: eg = abs(EjLUMO - Eshomo). A total of 12 descriptors were calculated.
Support vector machine
Support vector machines (SVM) 14-25 are a powerful state-of-the-art data mining algorithm for nonlinear input-output knowledge discovery. In SVM, the idea is to map the input data into a high dimensional feature space and subsequently carry out the linear regression in the feature space. 14, 15 Thus, the input-output pairs of training data of size n
where ã is a parameter to be optimized. ã controls the amplitude of the Gaussian function and, therefore, controls the generalization ability of SVM. The RBF function is one of the most commonly used kernel function in the SVR technique and has been used widely in SVM. 15, 16
Besides the parameter ã, the parameters C and å also need to be adjusted by users when a å-SVR model is trained. The parameter C, the penalty factor, controls the trade off between errors of the SVM on training data and model complexity. The parameter å controls the width of the å-insensitive zone and determines the complexity and the generalization capability of the network.
The tube parameter å is difficult to select as one does not know beforehand how accurately the function will fit. The v-SVR was developed to automatically adjust the tube size, å, by using a parameter v. In the v-SVR, the parameter v replaces the parameter å of the å-SVR and used to control the number of support vectors. 24, 25 Similar to Eq. 3, the primal form is
All SVM models from the present paper were obtained with winSVM, which is freely available for download (http://www.cs.ucl.ac.uk/staff/M.Sewell/winsvm/). Stepwise multiple linear regression (MLR) was used to select an optimum subset of descriptors and develop a MLR model. Then these descriptors were used to develop SVM models.
RESULTS AND DISCUSSION
By carrying out the correlation between the 12 descriptors and reactivity parameters lnQ and e in the training sets with stepwise MLR, the optimal MLR models of lnQ and e were obtained. The optimum subset of descriptors in lnQ comprises three descriptors: Epg (the energy gap between HOMO and LUMO of beta spin), Dmc2 (Mulliken atomic spin densities on C2), and qdmr3 (the absolute value of the Mulliken charge difference on R3 between radical and styrene). Statistical parameters corresponding to the MLR model are the following:
where N is the number of monomers used, R is the correlation coefficient, se is the standard error of estimation, F is the Fischer's ratio. The the root mean square (rms) errors for the training, validation and prediction sets are 0.555, 0.578 and 0.496, respectively.
By the view of the frontier molecular orbital (FMO) theory of chemical reactivity, the formation of a transition state is due to the HOMO (electron-rich component) and LUMO (electron-deficient component) interaction. Thus, the FMOs are separated from the other orbitals and become very popular quantum chemical descriptors. In general, the HOMO energy can describe ionization potential; while the LUMO energy can reflect the electron affinity. 1 Moreover, both the HOMO and the LUMO energies are important in radical reactions. 26 Eg (the energy gap between HOMO and LUMO) is an important stability index. For example, a large Eg value implies high stability for the molecule in the sense of its lower reactivity in chemical reactions. 27, 28 The parameter Q measures of the resonance stabilization, i.e. a monomer, that form free radicals easily, possesses a large lnQ value. Thus, it is easy to understand that a radical with a small Epg value would have high reactivity and a large lnQ value.
In fact, all chemical interactions are resulted from electrostatic (or orbital), which are based on atomic charges. 1 Atomic charge descriptors can reflect molecular chemical reactivity (or intermolecular interactions). A large Dmc2 or qdmr3 implies that the monomer (or the radical) possesses a small resonance stabilization and has a small lnQ value.
The optimum MLR model of the reactivity parameter e includes two descriptors, Eag and eg. Statistical parameters for e are the following
The rms errors for the training, validation and prediction sets are 0.395, 0.337 and 0.343, respectively. The parameter e is a measure of the polarity of a monomer (or a radical). In Eq. 15, the two descriptors Eag and eg are related to the FMO energies of radicals. Usually, the FMO energies (or the energy gap Eg between HOMO and LUMO) are correlated with the polarization of a molecule. 1 Thus, Eag and eg are also related to the reactivity parameter e.
The program winSVM was used to develop SVM models for lnQ and e. In order to get a satisfactory model, SVM parameters C, å (or v) and ã need to be selected properly. Here, take the training of SVM models of lnQ as an example. Firstly, the training set of lnQ was selected as the input file and optimized 100 times. Then the output results were inspected. Learning parameters of C = 10, å = 0.01 and ã = 0.2 produced a low mean squared error. Thus, these SVM parameters were used for the validation set and optimized furtherly. By training the SVM models of lnQ with different parameters ã of 0.1, 0.15, 0.2, 0.25, 0.3 and 0.4 (C = 10, å = 0.01), the rms errors of the validation set are 0.384, 0.353, 0.332, 0.353, 0.372 and 0.443, respectively. Thus, the optimal ã corresponding to the minimal rms error was set to 0.2. Subsequently, by using ã = 0.2 and C = 10, another parameter å was optimized with å being 0.005, 0.01, 0.03, 0.05, 0.07 and 0.08, respectively. The validation set rms errors based on different å are 0.334, 0.332, 0.323, 0.319, 0.320, 0.330 and 0.327, respectively, which shows that the optimal å was fixed to 0.05. Similarly, the last parameter C based on å of 0.05 and ã of 0.2 was optimized. The validation set rms errors are 0.337, 0.320, 0.318, 0.319, 0.322 and 0.326, respectively, when the parameter C was tuned with C of 5, 8, 9, 10, 11 and 12. Thus, the optimal C was 9. Lastly, the optimum å-SVR model of lnQ with the RBF kernel (C= 9, å =0.05 and ã = 0.2) was tested by the prediction set. The rms errors for the training, validation and prediction sets are 0.343, 0.330 and 0.317, respectively. The lnQ values calculated with the å-SVR model are listed in Table 1 and depicted in Figure 1.
The SVM parameters of e were tuned with the same methods. Learning parameters of C = 100, v = 0.6 and ã = 0.6 were obtained after initial optimization. Then the different parameters ã (0.3, 0.4, 0.5, 0.6, 0.7, and 0.8), v (0.2, 0.3, 0.4, 0.5, 0.6 and 0.7), and C (10, 50, 90, 100, 110, 130) were tested. In the end, the optimal SVM parameters (ã =0.4, v = 0.5 and C = 100) were obtained. The optimal v-SVR model produced rms errors of 0.257 for the training set, 0.264 for the validation set and 0.266 for the prediction set. The calculated e values from the v-SVR model are listed in Table 2 and depicted in Figure 2.
The rms errors for prediction sets of the lnQ and e models based on the ANN approach were 0.313 and 0.271, respectively , which are close to the results obtained with the SVM approach. But the ratio (30/30) of fitted samples (15+15 = 30) to training samples (30) in this paper are larger than that (16/40) in previous model . This means that the present SVM models have better statistical quality and generalization capability. For the previous ANN models of vinyl monomers, the training set rms errors were 0.581 for lnQ and 0.234 for e. 6 In addition, in the ANN models of acrylate monomers, the training set rms errors were 0.302 for lnQ and 0.127 for e. 7 In fact, as long as the correlation coefficient R between the experimental and calculated e values is greater than 0.876, then a good fit has been achieved. 2 The R values of the v-SVR model for e in this paper are 0.971 for the training set, 0.943 for the validation set and 0.960 for the prediction set, which are larger than 0.876. This illustrates that our results for the model e are satisfactory and acceptable. In comparison with previous other models, the present SVM model shows satisfying statistical results, although the number of samples used in this article is much greater than that in previous models. 2, 5-7
Two SVR models were developed to predict the reactivity parameters Q and e, respectively, of vinyl monomers in radical copolymerization. Comparison to existing models, the SVM models shows good statistical characteristics. We have the following conclusions:
1) To developed the SVR models, calculating quantum chemical descriptors from styrene and radicals with structures CH3CH2C1H2-C2HR3· formed from C1H2=C2HR3 and CH3CH2· is feasible.
2) The SVR models describing the non-linear correlation between the quantum chemical descriptors and reactivity parameters Q and e is accurate and acceptable.
3) Atom charges ( Dmc2 and qdmr3) and frontier molecular orbital energies (Ep, Eag and e ) are the most important factors in predicting monomer (or radical) reactivity based on the reference monomer styrene (Q = 1.0 and e = -0.8).
We thank the financial supports from the Open Project Program of Key Laboratory of Environmentally Friendly Chemistry and Applications of Ministry of Education, China (Grant No. 10HJYH06), Scientific Research Fund of Hunan Provincial Education Department (Grant No. 09A019), and the National Natural Science Foundation of China (Grant No. 20972045).
1. M. Karelson, V. S. Lobanov, A. R. Katritzky, Chem. Rev. 96, 1027, (1996). [ Links ]
2. C. G. Zhan, D. A. Dixon, J. Phys. Chem. A, 106, 10311, (2002). [ Links ]
3. A. D. Jenkins, J. Polym. Sci. A: Polym. Chem., 37, 113, (1999). [ Links ]
4. T. Alfrey, C. C. Price, J. Polym. Sci. 2, 101, (1947). [ Links ]
5. A. A. Toropov, V. O. Kudyshkin, N. L. Voropaeva, I. N. Ruban, S. Sh. Rashidova, J. Struct. Chem. 45, 945, (2004). [ Links ]
6. X. L. Yu, W. Q. Liu, F. Liu, X. Y. Wang, J. Mol. Model. 14, 1065, (2008). [ Links ]
7. X. L. Yu, B. Yi, X. Y. Wang, Eur. Polym. J. 44, 3997, (2008). [ Links ]
8. X. L. Yu, X. Y. Wang, B. Li, Colloid. Polym. Sci. 288, 951, (2010). [ Links ]
9. S. C. Rogers, W. C. Mackrodt, T. P .Davis, Polymer 35, 1258, (1994). [ Links ]
10. G. C. Laurier, K. F. O'driscoll, P. M. Reilly, J. Polym. Sci.: Polym. Symp. Ed. 72, 17, (1985). [ Links ]
11. N. Kawabata T. Tsuruta, J. Furukawa, Makromol. Chem. 51, 70, (1962). [ Links ]
12. J. Brandrup, E. H. Immergut, E. A. Grulke, Polymer Handbook, 4th ed. Wiley, New York, 1999. [ Links ]
13. M. J. Frisch, G.W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, V. G. Zakrzewski, J. A. Montgomery, R. E. Stratmann, J. C. Burant, S. Dapprich, J. M. Millam, A. D. Daniels, K. N. Kudin, M. C. Strain, O. Farkas, J. Tomasi, V. Barone, M. Cossi, R. Cammi, B. Men-nucci, C. Pomelli, C. Adamo, S. Clifford, J. Ochterski, G. A. Petersson, P. Y. Ayala, Q. Cui, K. Morokuma, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. Cioslowski, J. V. Ortiz, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. Gomperts, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, Peng, W. Chen, M. W. Wong, J. L. Andres, M. Head-Gordon, E. S. Replogle, and J. A. Pople, Gaussian 03, Revision B.05. Gaussian Inc., Pittsburgh, PA, 2003. [ Links ]
14. V. Vapnik, The Nature of Statistical Learning Theory. New York, Springer, 1995. [ Links ]
15. V. N. Vapnik, Statistical Learning Theory. Wiley, New York, 1998. [ Links ]
16. G. Camps-Valls, A. M. Chalk, A. J. Serrano-López, J.D. Martín-Guerrero, E. L. Sonnhammer, BMC Bioinformatics, 5, 135, (2004). [ Links ]
17. R. Kumar, A. Kulkarni, V. K. Jayaraman, B. D. Kulkarni, Internet Electron J. Mol. Des., 3, 118, (2004). [ Links ]
18. C. J. C. Burges, Data Min. Knowl. Disc., 2, 121, (1998). [ Links ]
19. F. Luan, R. Zhang, X. Yao, M. Liu, Z. Hu, B. Fan, QSAR Comb. Sci., 24, 227, (2005). [ Links ]
20. H. D. Li, Y. Z. Liang, Q. S. Xu, Chemomet. Intell. Lab. Syst., 95, 188, (2009). [ Links ]
21. O. Ivanciuc, Internet Electron. J. Mol. Des. 3, 802, (2004). [ Links ]
22. J. Bi, K. P. Bennett, M. Embrechts, C. M. Breneman, M. Song, J. Mach. Learn. Res. 3, 1229, (2003). [ Links ]
23. J.-P. Doucet, F. Barbault, H. Xia, A. Panaye, B. Fan, Curr. Comp. Aid. Drug Des. 3, 263, (2007). [ Links ]
24. B. Schõlkopf, A. J. Smola R. C. Williamson, P. L. Bartlett, Neural Comput. 12, 1207, (2000). [ Links ]
26. K. Tuppurainen, S. Lõtjõnen, R. Laatikainen, T. Vartiainen, U. Maran, M.Strandberg, T. Tamm, Mutat. Res., 247, 97, (1991). [ Links ]
27. D. F. V. Lewis, C. Ioannides, D. V. Parke, Xenobiotica 24, 401, (1994) [ Links ]
28. Z. Zhou, R. G. Parr, J. Am. Chem. Soc. 112, 5720, (1990). [ Links ]
(Received: May 5, 2010 - Accepted: May 20, 2011).