Feature reduction using a RBF network for classification of learning styles in first year engineering students

When having a large number of variables in the input of an Artificial Neural Network (ANN), there are different problems in the design, structure and performance of the network itself. Feature reduction is the technique of selecting a subset of ‘relevant’ features for building robust learning models as in an artificial neural network. In this paper, the well-known Principal Component Analysis (PCA) approach is applied in order to tackle this phenomenon in the design of an ANN with Radial Basis Functions (RBF) to be applied to classify users according to predefined learning styles. The model is developed upon a data set built from answers provided by 183 users of a computer interface to a series of 80 questions (that correspond to characteristics related to users learning style), associated to one of four (4) possible classifications/styles. This data set, without pre processing, is initially used for training an ANN with a Radial Basis Function type (RBF). Then, the Principal Component Analysis (PCA) is used for preprocessing the data set, the quantity of dimensions is reduced (80 measured characteristics) which are the input to the ANN. The main objective is to see the relevance that an ANN could have as classifier element in the User Adaptive Systems (UAS).


INTRODUCTION
A user model is a representation of the knowledge and preferences that a system "believes" its user has.They are representations of features and decisions of the user which are accessible for the software system.A system using this type of model can adapt its behavior to the needs of the user and is capable of dynamically constructing a representation of the interests of the user and his/her characteristics.For our purposes a series of users learning styles have been compiled, following the scheme of Alonso [1] and using a specific tool [27].The idea is that data initially gathered by this tool, could be used as feedback for the inference process of a classifier system can follow the user's behavior in the system.Hereby the adaptation of the system to the behavior and preferences that the user develops during a teaching-learning process will be obtained.
Since one of the crucial aspects of the adaptation scheme previously described is the proper use of the classifier and having a lot of variables (or attributes) on the input of an Artificial Neural Network (ANN) generates problems in the design, structure and performance of the network itself (which is the case of the data gathered from the users) the predictive precision of a classifier can degenerate, especially when it faces irrelevant attributes.The explanation to this phenomenon is described in the Curse of the Dimensionality [5] which refers to the exponential growth of the number of necessary instances to describe the data depending on the dimensionality (number of attributes).A selection of attributes attempts to obtain a subset of the original attributes of a data set, in such a way that if a learning algorithm is executed over this subset, it achieves the largest precision possible.When obtaining patterns to describe people (for example their learning style) it is normal to handle a great amount of related variables, pre-processing these variables can lead us to extract a subset that contains variables or attributes in terms of the classifier.This paper is organized as follows: the following section briefly introduces the ANN of the RBF type and the PCA, here, the paper continue with the work related to the dimensionality reduction by using Artificial Neural Networks (ANN) and using Principal Component Analysis (PCA), then the use of ANN in the proposed Adaptive Systems is showed; next section develops the approach of the work done; then to continuation some obtained results are presented, and finally the conclusions and future work are enunciated.

RELATED WORK
In this section both the RFB networks approach and the PCA technique are described.It also showed how they have been used in an extensive way.This section only presents some examples of both techniques, in an application level, and the interesting synergy between them.Finally, the applicability ANNs have as an adaptation technique (classifying users) inside an Adaptive System, which is the field of study, is also shown.

RBF Networks and Principal Component Analysis
RBF networks are proper for solving pattern classification problems due to their simplicity, their topological structure and their ability to outline the learning process [19].
The performance of a RBF depends on the number of positions of the functions of radial base, its form and the method used for learning.The strategies for learning in RBFs can be classified in three aspects: -First, randomly selecting the centroids when training with data [7].-Second, using methods that relay in not supervised procedures to select the centers ( [19,14]).-Third, using methods that relay in supervised procedures to choose the centers ( [11,15]).
A RBF consists of an i-dimensional input that passes directly into a hidden layer (see Figure 1).It is supposed that there are j neurons in the hidden layer.Each one of the j neurons in the hidden layer applies an activation function, which is a function of the Euclidean distance between the input and a prototype j-dimensional vector.Each hidden neuron consists of its own prototype vector as a parameter.The output of each hidden neuron is measured and sent to the output.The output of the network consists of the sums of the values of the hidden neurons.
Principal Component Analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a (small) number of uncorrelated variables called Principal Components.
The new variables are linear combinations of the previous ones and they are constructed according to the order of importance in terms of the total variability that they get from the sample.The aforementioned procedure was invented in 1901 by Karl Pearson [21], nevertheless, the complexity of the calculations delayed its development until the computers appearance and their usage in the second half of the 20th century.The relatively recent blossoming of PCA makes it still an scarcely used technique by a lot of researchers who are not specialized in statistics.
In an ideal way, it seeks to obtain m < p variables that are linear combinations of the original p and which are uncorrelated, gathering most of the information or data variability.If the original variables are uncorrelated at the beginning, then it is not appropriate to do PCA.The goals which are pursued by this procedure are mainly two: to reduce the dimensionality of the data set and to identify new meaningful variables.

Principal Component Analysis and Neural Networks in the Dimensionality Reduction Problem
Thomas Portele [22], investigated the possibility of classifying announcers automatically, according to their linguistic style, and he did this analysis on three different domains.For every announcer diverse parameters were calculated, and these parameters were reduced to linguistically interpretable components using PCA, later the classes were established by means of a cluster analysis.The untreated inputs were classified using an ANN with different error rates for every domain.
The work of McMahon in 2005 [18], shows an approach for the classification of regional segments in the myocardium.The effort measures taken from the cardiac cycle of pigs electrocardiograms are analyzed.The classification is made by PCA and by ANN which are combined in a process of Data Mining.The differences in the effort wave forms between a normal myocardium and a sick one, can clarify the corresponding changes in the physiology.
The altered function of the cardiac muscle is reflected by the effort, and the computer analysis helps in the diagnosis of the ischemia.
In [10], it can be noticed how different types of algorithms have been proposed to choose the prototype and the training of a RBF.The paper presents a learning algorithm based on gradient decrease to train the RBF and additionally proposes the PCA in order to find the number of patterns in a classification problem.
Balasubramanian in 2007 [2], shows an application in diagnostic images.An automatic classification process of these images in four classes is shown: Normal, Cyst, Benign and Malignant, by using texture characteristics that are extracted by using several statistical and spectral methods.The ideal process of feature selection is done manually.The PCA is used to extract the principal characteristics or directions of maximum information of the data set.Using these ideal characteristics, a final set of combined features is used to do the classification in some of the classes mentioned previously.For this process the clustering and ANN methods are used.
In ( [25], [26]) an approach is proposed for face recognition by using PCA and ANN of the RBF type.The PCA has become a very popular approach of representation for face images, due to the fact that not only it reduces the image dimensionality, but also it allows retaining variations in the information of the image.After applying PCA, the neurons of the hidden layer of the RBF change themselves, considering discriminating characteristics between classes of the training images.This helps the RBF to acquire information about the variability of the input space and improves its generalization capacity.
Input layer Hidden layer Output layer Scheme of a Radial Basis Function network, adapted from [6].
In [13], a new multivariate non-linear technique is proposed to model and predict chaotic time series.This method makes an analysis of the relationships between spaces and states, with this process an inverse-predictability and time lapses are introduced to discover fundamental relationships.Then, the time series are predicted by a multivariate prediction.
Although the multivariate time series can give a lot of information about complex systems.They also come with a big number of input variables, which gives as a result, an over training and poor generalization capacities.In order to overcome these obstacles, PCA is used to extract the principal characteristics of the time series and to reduce the input to the model.Later a four layered ANN is used as a predictive model.
Finally, it is to note that an ANN can be used by itself as a tool to do a PCA [20]; although this it is not the objective of this work, it has been thought of as a future approach (see Conclusions), that allows to increase our criteria of model comparison to the problem treated.

Neural Networks in User Modeling
The identification of the adaptation tasks and its later accomplishment, are fundamental phases in the development of Adaptive User Systems (AUS) that is the reason because is possible to find many references about this in the user modeling, adaptive user interfaces or in the human-computer interaction fields ( [12,17], [18,9]).Most of the reviews done until now are oriented to objectives and techniques; this means that the tasks and systems that apply them are classified depending on the goals that it wants to achieve.Nevertheless, the three aspects that influence the design of the adaptation tasks are rarely described, these are: -First, adaptation types.
-Second, objectives and techniques.
In general, an User Model contains some adaptive and adaptable elements.Ideally, the adaptable elements must be reduced to the fewest (age, gender, favorite color, etc.), while the other elements (favorite topics, behavior patterns, etc.) should be obtained in the learning process.These concepts have been presented as implicit acquisition models [23].
The user modeling problem can be focused through automatic learning, the reason for this is because a user exhibits a typical behavior when it accesses an Adaptive User System, and the interactions set that contains these patterns, can be stored in a database or log.In this context, the Automatic Learning and the Data Mining techniques can be applied to known regularities in the user patterns and to integrate them, as part of the user model.A learning technique output is a structural description of what has been learned and this can be used to explain the original data in order to be able to do predictions.According to this perspective, Data Mining and other learning techniques make possible to create user models in AUSs.
It is said in [8], that an AUS can be divided into two stages (see Figure 2).The first process models the user, while the second one takes the generated model and provides the basis of the adaptation.This work is related to the first type of process.M. Kayama and T. Okamoto, [16], have worked with a model in which the user explores activities in the cyberspace with a mechanism based on a sub-symbolic approach, which helps to decide navigation strategies.This model does not interfere with the student learning system; it only helps to navigate on the Internet to acquire knowledge.The idea is to use Hypermedia Systems as a learning environment, in such a way that the student can be the one who explores the network by himself.

Beck in [3] & [4], builds user models in an Intelligent
Tutor System, the information is gathered by this Tutor System, which uses an ANN, to give individualized recommendations according to its level of knowledge.
Wilson, [28], develops a study by means of an experiment that measures the performance while using a physiologically controlled assistance in real time.To do that, they used six channels, among which are: EEG, ECG, EOG and a breathing channel, likewise they used an ANN for tasks location, taking into account the assigned work for the user.
IMMEX research [24], describes a probabilistic approach to develop predictive models about how a student learns solving problems skills in qualitative general chemistry.The intention is to use these models to apply, in an active way, interventions in real time when it is detected that the learning is not optimum.
First, an auto-organized neuronal networks approach is used to identify the most common strategies in the online tasks, and then Hidden Markov Models are applied to the sequences of the aforementioned strategies as ways of learning.

PROPOSED APPROACH
In the following section, the performed work will be shown.Initially, how the information is obtained and what type of characteristics this information has will be explained in order to apply the PCA approach on this data, and have an original set and a pre-processed set to load into the RBF that is expected to be used as a classifier model.

Data Collection
The information is initially gathered through a web application that automates the Honey-Alonso's Learning Styles Test [1], this test consists of 80 affirmations, each of which has 2 options to be marked: more (+) or less (-); they are selected depending on the degree of acceptance of the individual (the one who takes the test) related to the corresponding affirmation.This test generates a numerical and graphical result for each style, rating it with a scale of 5 preference levels: Very low, Low, Moderate, High and Very high.The test that is applied is also based on the Spanish translated version, by Catalina Alonso, adding some modifications that help clarify at the moment of answering it, for example a scale of relative weights for the rating of each affirmation has been established and is based on Kolb's learning cycle.This test considers 80 questions (that in our scheme are associated to user attributes).The total number of users on whom this test was applied was 183 all of them were in a course of the area of Basic Sciences in the Engineering School of the Universidad del Sinú -Elías Bechara Zainum during two consecutive semesters.It can be easily noticed that it is a data series where there are more attributes that sample data, which takes part in this type of analysis.
Once the student logs in to the test index, which means, the questionnaire, the user is faced with the questions to be answered, which are 80, divided in groups of 20; each group of questions is related to a learning style and defines it.Each question has 4 scores or different values, which the user can choose in order to give an answer to each question depending on the level of acceptance that he has for every question.These are: (0 -25 -75 -100) where zero (0) is totally in disagreement and hundred (100) totally in agreement.It is necessary to highlight that there is no average or half rate in any answer; this is in order to eliminate ambiguities and be able to determine in a clearer form, the style the user belongs to (Reflective, Theoretic, Active, and Pragmatist).Figure 3 shows an interface of the test.
Once the information is collected, the system proceeds to construct the files that will serve as input and output to our neuronal network, the output has been changed to numerical values as follows: REFLECTIVE = 0, THEORETIC = 1, ACTIVE = 2, PRAGMATIST = 3.
Is necessary to remember that, the correlation coefficient indicates the relationship between two variables, if the rest of the model variables effect is eliminated.When these variables share a lot of information among them, but not with the rest, the partial correlation is high, this affects the analysis.The matrix obtained in the correlations analysis, which is partially reproduced in Table 1 due to its length, presents the coefficients of sample adequacy for every variable.It can be observed that the correlation coefficients are low, so it can be affirmed that PCA is appropriate for the studied variables.

Application of the Principal Component Analysis (PCA)
The principal components are obtained after a process of square roots and vectors calculation presented in a symmetrical matrix.These components, as it was already mentioned, have as a goal to gather the majority of the observed variance, this avoids obtaining redundant information.In order to complete this, the variables have to be uncorrelated (as it was already proved) and they have to be able to be expressed as a linear combination of the variables which have been really observed.To maximize incorporated variance in each one of these components implies that each one contains a major quantity of information within itself.
Considering the table of Proper Values (Table 2), is possible to decide how many components or factors to choose.There are rules to know the most appropriated number to keep, for example, the one which is known as Kaiser Criterion, which indicates that it is necessary to preserve the main components of which proper values are larger than the unit, though the most used criterion is observing the average of total variance explained by every component or factor, and when this one reaches an accumulated percentage considered high enough (normally near to 80%), it means that the number of factors is enough.
In our model, it is verified that from the 23rd component the proper value begins to be lower than the unit, though this component still remains a high proper value, and in addition, the percentage of the explained accumulated variance is increased to 85.81 %, in this sense, it can be said that this is a sufficiently high value to think that 23 is an enough number of factors.Due to space limitations, it will be just shown the data corresponding to the first 23 factors in Table 2.

Is possible also partially observe the table about
Correlations between the variables and the factors (Table 3).
Using the table, it is possible to do a factorial analysis, for better legibility of the data (from 0.5 is possible to consider that the sample adequacy is good for a factorial analysis).To do that, the previous table has been normalized, by considering   Finally, it is necessary to obtain the matrix of coefficients for the calculation of the factorial rating, which contains the average for every variable in order to be able to calculate the factorial rating.By means of these estimated coefficients, a linear equation can be constructed for each of the extracted components, based on the variables and the factorial rating.

Sending Data to the Model of Radial Base Functions Network
In this approach, the PCA is used to reduce an initial set of 183 samples, with 80 characteristics each, to a new one which contains the same 183 samples with a lower number of characteristics, 23.The RBF model is proved by each one of these data sets and this is what will be shown in the following section.

RESULTS
Initially, the RBF was trained with the original data.
For this case, a network of architecture 80-3-1 was used, this RBF was iterated during 400 Epoch (an Epoch is given for every completed run of the data set) and it had a minimal error (training) reflected on the performance indicators (MSE -Mean Square Error) showed in Table 5; also, in Figure 4, the convergence of the error is shown.
Once the training was finished, the RBF is proven, using a cross-validation approach, this error is reflected also in Table 5, the Cross-validation scheme offers a Confusion Matrix shown in Table 6, where it is possible to see that, with this data set, the Active and Pragmatist types are correctly recognized (though they are the least present in the data set) and some error is presented for the Reflective and Theoretical types, being the last one, bigger.
From the previous one, the following indicators (Table 7) for our classifier scheme were obtained: As can be seen in Figure 4, the error (on validation) converges quickly from the Epoch 10 approximately, and becomes stable short before the Epoch 200.
Then, the RBF is trained by the second data set obtained through the PCA.In conformity with the previous network scheme, the new architecture of this RBF is 23-3-1, the RBF was also iterated during 400 Epoch; and it had a minimal error reflected in the performance indicators shown in Table 8; the convergence of the error in Figure 5 is also shown.Similar to the process below, once the training was over, the RBF has to be proven, using again the cross-validation scheme (this error is in Table 8); the Confusion Matrix shown in Table 9 was generated.
From the previous one, the following indicators (Table 10) for our classifier scheme were obtained: As can be seen in Figure 5, in this approach, the error descends from the Epoch 118 and in other Epochs, and is stable and minimum, at the end of the Epoch.Here the error (training and validation) is smaller than in the scheme which considers the original data, about 33% less.

CONCLUSIONS AND FUTURE WORK
As it has been verified, the PCA allows to discover and to prioritize the attributes that a neural network scheme has to take into account in the classification process, applied in the architecture of an UAS, by reducing the redundant information which can exist among them.The identification of these components allows us to know the most important aspects, in order to get the individual learning style.
By means of the principal component analysis, it has been verified that, in fact, the proposed attributes can be summarized, in 23 factors (approximately, a quarter of the 80 original ones), which eliminate the redundant information according to the characteristics    vs Epoch 200, additionally this new approach can learn all the data, in contrast with the first, due to the reduction in the complexity of the input, which is an advantage if it is considered this as a real time adaptation approach, likewise, it can be said that if another type of pre-process is provided together with the PCA, this type of error can reduce itself in an acceptable way, or with the application of several approaches which combine RBF with PCA, such as proposed here.
All these results indicate that the RBF performance is worse with few inputs, which was also expected, but if is required solving the dimensionality problem, it should be to lose some information.Ideally, it wants to choose controls with exactly the same observable variables for every treated unit, but then, it has been that there is a very big sample of characteristics, which comes from the type of application for the classification process.
A next step is to consider different comparative approaches, based on this work: one where the PCA is used to support the architecture and design of the hidden layer neurons of the RBF and another approach where the RBF is used itself in order to do the PCA and once the summary set is found, a neuronal scheme is used as classifier.
On the other hand, for generalization purposes, it is necessary to use additional characteristic vectors for the analysis and, for validation purposes, another feature reduction method like Singular Value decomposition (SVD) for example.

Figure 3 .
Figure 3. Initial screen (in Spanish) of the tool that captures the information of the test, [27].

Figure 4 .Figure 5 .
Figure 4. Error convergence in training and validation for the original approach.

Table 3 .
Correlations between variables and factors.

Table 4 ,
There must be interest in the one of maximum value.Now manually, It should be noted which input has given the maximum value in the Table of Correlations between the variables and the factors.

Table 5 .
Performance of the network for the original approach.

Table 6 .
Confusion matrix for the network results with the original approach (Validation).

Table 7 .
Confusion matrix analysis for the network results with original approach.

Table 8 .
Performance of the network with PCA approach.

Table 9 .
Confusion matrix for the network with PCA approach (Validation).

Table 10 .
Confusion matrix analysis for the network whit PCA approach.