INTRODUCTION

The problem of patient classification with diabetic neuropathy condition (DN) is a research issue in the health area, particularly in the Faculty of Health Sciences of the *Universidad de Antofagasta.* One of the activities of the professionals in the Department of Kinesiology of this institution is to support adults and elderly patients in the rehabilitation of musculoskeletal pathologies, being the main study group diabetic individuals. In 50% of cases of type 1 and 2 diabetes mellitus, people develop DN. This affects the peripheral nervous system and may cause loss of sensitivity, muscle pain and weakness, since the nervous system does not correctly encode the signals sent by the body ^{(}^{1}. In extreme cases, due to sensory loss and poor circulation, patients are unaware of injuries or wounds to their limbs, leading to amputation of affected parts. Hence, a patient classification model is needed, which through the analysis of quantitative attributes it identifies the presence of the pathology.

From the methodological point of view, a variation of the Cross Industry Standard Process for Data Mining (CRISP-DM) guide was utilized in this work, for the development of machine learning models. The evaluation of the results and the development process was based, in addition to the standard performance metrics, on the judgment of experts in the domain (kinesiology area) for the correct interpretation and evaluation of the results obtained. The initial situation of how the physical therapist approached the problem of determining whether a patient presents a neuropathy, considers the analysis of symptoms using standardized clinical evaluations and the posture tests using a Wii Balance Board (WBB) device.

The following business (clinical) objectives were considered in this domain:

• Improve patient care and treatment focus.

• Provide new sources of research, encouraging the use of machine learning techniques.

• Classification of patients using the data sampled from the WBB, to determine DN condition.

• Determine whether the measurement of center of pressure (COP) is a better predictor than clinical evaluations.

In the other hand, technical objectives listed in relation to data analysis are:

• Development of a predictive model that classifies, according to its attributes, whether the patient has diabetic neuropathy or not.

• Find the correlation between the attributes of the medical surveys and the features of the time series.

• Identify the relevance of each attribute in the classification task, establishing a hierarchy based on the class attribute.

Later, the results obtained from this work will allow us to expand the knowledge, refine the investigative process and to consider new important features, to improve the classification model.

In what follows, this article considers firstly the analysis of the state of the art, specifically related works using computational methods for the classification of pathologies, that may be applicable. Secondly, the materials required (data, hardware and software resources) and the methods used to generate results for each posed objective. Third, the results obtained from the data preparation and modeling stages are presented. Finally, the discussion of the obtained results and final conclusions are presented.

RELATED WORKS

As related works, those that use the COP time series are considered to differentiate between subjects of different age or health conditions. On the other hand, studies that consider electromyography signals are analyzed as an alternative, despite the fact that the nature of both signals are different.

Time series analysis

In time series analysis, different transformations are applied to obtain descriptive variables. This is useful when a time series lacks a clear trend or a certain cycle, considering that the COP is sensitive to various external factors such as diseases that alter posture, distractors, involuntary movements of the human body, etc. However, several studies have found that it is possible to identify conditions of the subject. As in Yamagata's research ^{2}, in which older adults, with no history of pathologies, a series of clinical tests are applied to study the stiffness that they adopt as a method to compensate for the body roll. This has made it possible to differentiate not only the age range of a subject or the condition of eyes open or closed, but also to find disturbances

associated with processes of other systems, such as breathing, the influence of vision, the vestibular and somatosensory system, proprioceptive information, among others.

In COP time series studies ^{2}, general statistical variables are obtained through calculations that do not represent a specific attribute of the subject, their health or a clinical assessment, but rather interpret the series as a single value, which then it is entered into a machine learning model. Examples of statistical variables include the slope of a series, the standard deviation, the mean, the variance, the maximum, the minimum, the median, etc. Because the physiology of the COP signal is complex, laboratory tests usually include a control group, that is, healthy people, since there is no standardized value that indicates whether a COP is normal or abnormal. For this reason, when researchers need to study physiological signals, they include both groups of patients, to compare the different results.

Feature extraction

The features extracted from a time series can be in the time domain or in the frequency domain. Subasi in his research ^{3}, obtained the coefficients of each sub-band after applying discrete Wavelet transform, to differentiate between normal, myopathic or neurogenic electromyogram signals. On the other hand, in Chamnongthai's work ^{4}, the characteristics are based on graph theory, where the time series peaks are discretized and transformed into weighted graphs, then the attributes are obtained to differentiate between normal signals or with myopathic or neuropathic pathologies. Gregory King ^{5} explores a mixture of unique attributes of the COP, based on parameters of frequencies, speed, body roll, among others. These studies have in common that, after feature extraction, the attributes are entered into three machine learning models: Support Vector Machine (SVM), a Random Forest, a Linear Discriminant Analysis (LDA) classifier and a neural network. The results obtained show that the time series feature extraction is applicable to the current kinesiological domain. However, it should be considered that electromyography signals are different from COP signals, since the former maintain a disruptive structure than postural signals, when an evaluated muscle is activated. On the other hand, the COP signal does not present the same behavior, since it represents the excursion of the posture, influenced by internal and external factors. Nevertheless, it seems feasible to adapt the methodology used in the related works to the current domain, in which the time series is transformed into new values (based on feature extraction), with its respective label: diabetic or neuropathic.

BACKGROUND

This section describes the background to the COP, which is a variable that represents the excursion of a subject's posture, used in posturology analysis. The next subsection describes its study.

Study of the center of pressure

COP is the most widely used variable in balance control research, since it is feasible to detect postural changes due to diseases such as Parkinson, multiple sclerosis, myopathies, etc. It is defined as the point of application of all vertical forces performed by an individual on a support surface ^{6}. In Figure 1, it is visualized that the center of mass (COM) is located close to the pelvic area, which projects a vertical force on the WBB. This accessory captures the data, in two dimensions: the anteroposterior (AP) and mid-lateral (ML) movement, which represent the Y and X axis respectively.

Through an analysis of the COP excursion, it is possible to determine if an individual generates compensation to maintain his posture. The tests that are applied to the patient are carried out in 30 seconds and in two states: with eyes open and eyes closed, since without vision, the brain tries to compensate for the sway, using the proprioceptive or the somatosensory system ^{(}^{7}. Additionally, distractors can be included as dual tasks and an unstable surface. This causes that the brain sends out more signals to compensate the posture, which makes COP more natural.

Machine learning algorithms

Machine learning is one of the branches of artificial intelligence, which seeks to build systems that work automatically in complex and changing environments. It is a combination of computer science and statistics. Advances in this subject have made it possible to create an approach based on learning behaviors that are mathematically modeled and that have their origin in stored data from the past. In the medical domain, the aim is to improve the diagnostic processes or the evolution of diseases, so it is of interest to have a data source and a model that assigns them to a certain class. The algorithms described below are based on the book on artificial intelligence by Peter Norvig and Stuart Russel ^{8} and data mining by Han, Kamber and Pei ^{9}:

1. Naïve Bayes: this classifier is based on the Bayes' theorem, in which given a set of classes *C*
_{
i
}
*,* the model predicts the value *X* to which it belongs, evaluating the highest conditional probability *P* of each *xk* to *n* elements, which is described in equation (1).

2) K-Nearest Neighbor: this model allows searching in the pattern space for the unknown k values that are closest to the training values. The degree of "closeness" between two tuples *X*
_{
1
}
*= (X*
_{
11
}
*... X*
_{
1n
}
*)* and *X*
_{
2
}
*= (X*
_{
21
}
*...X*
_{
2n
}
*)* is based on the Euclidean distance. The mathematical model is described in equation (2).

3. Artificial Neural Network: It consists of a set of interconnected nodes, in which each attribute *I*
_{
j
} enters separately, in parallel and crosses a hidden layer, in which, given a sigmoid activation function, it allows to adjust non-linear regressions, by means of the summation of weights *w*
_{
ij
} , where *i* is the value in the previous layer and *j* is the current value, plus a bias *θ* that are adjusted iteratively. The mathematical model is described in equation (3).

4. Decision Tree: it is based on a graph structure, where the root node refers to the main attribute, the branches reference the result of the evaluation between attributes and each terminal node (leaf) represents the class. By the gain information metric, the selection of attributes is carried out, such that *p*
_{
i
} corresponds to the probability that a tuple belongs to the class *C*
_{
i
} and, using a logarithm function of base 2, the number of bits necessary to encode the information. The mathematical model is found in equation (4).

METHODOLOGY

This section mainly describes an overview of the CRISP-DM methodology, used for the development of the project.

CRISP-DM

The CRISP-DM methodology was developed at the end of 1999, where five European companies led the project until its first version was developed, with the aim of providing a non-proprietary standard process model and freely available for use ^{10}. On the other hand, as it is the medical area there are certain considerations that are not specified. In Krzysztof and Moore's research, they warn that there must be some care in "privacy sensitivity mining" (social, legal and ethical issues) and in the heterogeneity of the data ^{11}. Because some clinical data come from individuals, their discreet use is necessary to prevent ethical conflicts with patients. To complement the warning of the mentioned authors, a model adjusted to the medical area is applied, known as Medical Domain (CRISP-MED-DM), proposed by Niaksu ^{12}. This model adds additional activities and renames others according to the clinical environment, without changing the main structure. Therefore, the original phases of the methodology are maintained.

The phases of the CRISP-DM methodology are:

• Business understanding.

• Data understanding.

• Data preparation.

• Modelling

• Evaluation

The development of the research followed the model proposed by the KDD process and more specifically by the CRISP-DM guide. Thus, in the business understanding stage, the business objectives, success criteria and technical requirements and objectives of the project were established. In the stages of understanding and data preparation, the characteristics of the data, the selection of attributes and the transformations were determined to obtain the datasets in the form required to perform data mining techniques. This stage included the use of RapidMiner 9.4 tool (Technical University of Dortmund, Dortmund, Germany), Python 3 (Python Software Foundation) and use of spreadsheets in Microsoft Excel 2016. In the modeling stage, predictive models were built using the application of machine learning algorithms: Naïve Bayes (NB), K-Nearest Neighbor (KNN), neural networks (NN) and decision trees (DT), whose performances were based on the calculation of standard metrics such as accuracy, precision, sensitivity or recall and the area under curve (AUC) of operational characteristics of the receiver (ROC).

General process diagram

In Figure 2, the entire development process is modeled, based on the CRISP-DM methodology phases. In each one, the outputs become the inputs of the following. The outputs of the Business Understanding phase are the business and technical objectives, in the Data Understanding phase there is an exploration of the original datasets (Excel file and time series). In the following, it includes clean datasets and extracted features from time series, then the Modeling phase includes the design of classification models, correlation analysis and attribute hierarchy analysis. Finally, the last phase includes the tabulated results.

EXPERIMENTS

In this section, the data sets used are detailed, as well as a brief description of the hardware and software used.

Patient datasets

There are two main data sets, one corresponds to clinical evaluations and surveys, tabulated in an Excel file, while the second corresponds to comma separated values files (CSV) for each posture evaluation performed on all patients. The description of attributes is listed in Table 1.

In the case of CSV files, they contain eight numerical attributes, without missing data. For analysis purposes, the most relevant attributes selected are COP X and COP Y. A representation of the COP for a diabetic subject is in Figure 3 and another one with diagnosed DN in Figure 4, where it cannot be clearly identified if there is a trend or a seasonality.

Time series have been grouped by type of examination, as detailed in Table 2 according to the indications of the domain expert, which gives a total of 349 different files or time series. The reason for adding different distractors (closed eyes, unstable surface and dual task), is to obtain several samples of the COP, altering the senses of the subject, generating a more autonomous postural balance.

Hardware

Include all physical components required to get and analyze data. These are:

• Wii Balance Board (WBB): device that obtains the coordinates of the patient's posture, through the calculations of the four internal sensors. The data is stored as a time series. Figure 5 represents the used device.

• Laboratory computer: equipment used by the physical therapist where the clinical data in Excel spreadsheet is generated and time series files are manipulated.

• Analysis computer: equipment used by the data mining analyst, with the necessary applications to carry out his activities.

Software

All computer components as applications installed on hardware equipment are:

• BrainBlox: application used by the physical therapist that allows interaction with the WBB and stored data from it.

• RapidMiner 9.4: application used to design machine learning models and data preparation.

• Microsoft Excel 2016: Microsoft application that allows interaction with data sheets.

• Google Collab with Python 3.7: Google tool for programming ad-hoc applications in Python.

Feature extraction and selection

The activities developed allowed the understanding of the data recorded by the physical therapist in both data sources: clinical evaluations in an Excel spreadsheet and a set of time series files. No modifications in Excel data spreadsheet were required, but the exclusion of irrelevant attributes.

To the second dataset, transformations were applied with the aim of extract features to represent the time series adequately. Using the RapidMiner tool, the following activities were realized:

• Define class attribute (label).

• Parse attributes Age, Glucose and Time MD to numerical.

• Conversion of the attributes Sex into nominal category type and the tests EFAM, NSS, DNS, SVAD, SVAI, SVBD and SVBI into ordinal category.

• Exclusion of the attributes Name and Subject. However, the subject identificatory would be used for labelling between the Excel spreadsheet and the time series.

Time series files were grouped according to the type of examination to which corresponds. Additionally, two files are added, General EO and General EC, which consolidate all time series with eyes open and eyes closed respectively. It results in a total of ten files, in which a Moving Average technique is applied to smooth the signal with a window size of 100 samples _{
(13)
} . Then, feature extraction is applied to get the maximum, minimum and median values from COP on the X axis and the Y axis. The process also includes the crossing of information between the subject identificatory in Excel spreadsheet and the time series of each patient, to label each group of files. Later, the final dataset is created with the following time series features:

• Maximum value in COP X.

• Minimum value in COP X.

• Median value in COP X.

• Maximum value in COP Y.

• Minimum value in COP Y.

• Median value in COP Y.

• Class from the Excel spreadsheet.

Construction of data models

In this phase, algorithms were designed according to the technical objectives initially set. Five different models were made and applied to each data source, according to its own limitations. Table 3 describes strategy used for each dataset and technical objectives. The algorithms used are the NB, the NN, the KNN, a covariance matrix and a DT. Cross-validation method is used, in which the data is divided into *k* units (10 - folds) and crossed with the training data, for each of the *k* units. Such, the last fold is used for testing, thus, 10% is the stratified testing data.

Default parameters of each model were used, according to the default RapidMiner settings. Minors settings were considered in KNN and DT models, just adjusting the *k* value and quantity of trees respectively. Parameters of models are detailed as follows: Table 4 for NN model, Table 5 for KNN model and Table 6 for DT model.

RESULTS

The metrics of evaluation are used to measure the performance of the models and verify the degree of compliance with the technical objectives.

* Accuracy: rate of correctly classified tuples. Calculation is shown in equation (5).

* Error: rate of incorrectly classified tuples. Calculation is shown in equation (6).

* Precision: the number of positive hits divided by the total of positive correct and incorrect instances. Calculation is shown in equation (7).

* Sensitivity: completeness measure that corresponds to the number of correctly classified positive instances. Calculation is shown in equation (8).

* AUC: rate of the degree of separability between classes. This curve plot two parameters: true positive rate (TPR) and false positive rate (FPR).

Calculation is shown in equation (9) and equation (10).

Classification model

This section lists the results obtained in the machine learning models, for each planted objective. In Table 7, it is inferred that NB achieves an average performance of 70% of accuracy, by the probabilistic model to the 12 attributes of the Excel spreadsheet, considering the low number of instances and the high dimensionality. This implies that it is feasible to design a model based only on clinical attributes, which is an alternative to the feature extraction from the time series. However, the value of the AUC considers a low differentiation between classes, with a value of 0.6, due to the low number of instances (21 patients) in the data set. Another point to consider is that the model is still influenced by the human bias present in each exam. So, the model should be limited only for comparative purposes between data sets and not as a definitive solution.

On the other hand, Table 8 and Table 9 show the metrics obtained, in the datasets of the time series features, grouped by type of exam. In the KNN model results, there is a balanced performance between eyes closed and eyes open exams. The set that achieves the best performance, with 77.5% accuracy and 0.83 AUC is the EO set. Results over 70% of accuracy are obtained in the sets of eyes open, except the EOUSDT data set. This is explained by the fact that the signal becomes more complex as more distractors are incorporated. In eyes closed data sets, on average there is a lower performance of the model, compared to eyes open. Although a high performance of accuracy is achieved in ECDT, with 86.5%, extracting new features is considered an alternative to improve performance.

In Table 9, similar results are obtained in the NN model. The EO data set obtained 74.5% correct answers, with values over 80% of precision and specificity. The ECDT data set obtained 74.5% correct answers, higher than the rest of the eyes closed exams, which reaffirms that the models are influenced by the type of features entered.

Attributes correlation

Regarding the second technical objective, the correlation matrix is used to verify the influence between the attributes. Table 10 shows ten results obtained from the model: the first five with the highest positive correlation and the rest with negative correlation. The results indicate that a clear correlation is not found between the set of attributes. This gives an indication that none of the evaluations supersedes another.

Figure 6 shows the correlation between attributes, excluding those non-binary categorical, such as the NSS and the vibration tests. The results indicate that the direct proportionality is low between attributes, with the highest value being 0.33 between age and time of medical diagnosis of diabetes.

In Table 11, five results with the highest correlation between attributes are listed. The eyes open tests, including unstable surface and dual task datasets are grouped in EO and the eyes closed tests in EC respectively, then the calculation is made for each one. The results indicate that the high correlation (above 0.8) implies that there is redundancy when considering the maximum, minimum and median at the same time.

In Figure 7 and Figure 8, high correlation is obtained only between COP X features likewise COP Y features, for both data sets. No correlation exists between X features and Y features.

Attribute hierarchy

In the case of the third technical objective, the DT algorithm is applied to identify which of all attributes gets the highest gain information ratio to classify. Next, the graph of the generated DT is in Figure 9.

Figure 9 shows that the NSS attribute is better to identify the pathology (0 for diabetic and 1 for neuropathic). This attribute is divided into: 0 for normal, 1 for mild, 2 for moderate and 3 for severe symptoms. If NSS score is mild, then the DT considers to evaluate the Age attribute as a second option. So, if Age is greater than 70.5 years old then it classifies as DN, else not DN. Nevertheless, there is a margin of error when classifying patients with DN and without DN, selecting age > 70.5 as the discriminator, which is reflected in the AUC value.

Compared with KNN and NN classifiers, the DT model has higher performance than previous ones, reaching an accuracy rate of 83.3%, with the lowest error rate, as shown in Table 12. However, AUC value is 0.5 then the class attribute is not correctly discriminated against. High dimensionality and low quantity of records would affect the model.

DISCUSSION OF RESULTS

Results were presented to the academic physical therapist, during the phases of the CRISP-DM methodology and its development. The emphasis was on the analysis and transformation of the datasets, in addition to the modeling of solutions to meet the proposed objectives.

The results of the Excel spreadsheet dataset show that accuracy obtained was 67% (33% error), using the NB classifier, which it fits properly even though the high dimensionality and low quantity of records. With respect to the Excel spreadsheet dataset, it is agreed that the NSS exam, which obtained a higher score when applying the DT, is the most appropriate to be performed during the clinical evaluation of patients and to label them with DN or not DN. On the other hand, the tests that are performed with the tuning fork (SVAI, SVAD, SVBI and SVBD) are those with the greatest bias, because what is evaluated is the patient's perception of a stimulus, which makes it a very subjective measure, compared to survey type tests like EFAM, DNS or NSS. This assumption is verified in the DT model, since these do not appear in the final tree.

In time series analysis, with the KNN strategy, the model with worst performance accuracy and error rates are between 51% and 49% respectively and 86.5% with 13.5% of accuracy and error rate the one with best performance. In the case of the NN model, 60% and 40% of accuracy and error rate for the worst performance and 75.5% and 24.5% for the best performance. Nevertheless, results are influenced by the type of exam and the features extracted. This assumption is verified when separated models are applied to each dataset (time series files). The rest of the datasets got on average 69% of accuracy with KNN and 59% for NN. In general, models allow to classify the pathology, but are not achieving higher performance, thus different features from time series must be considered.

CONCLUSIONS AND FUTURE WORK

New combinations of attributes were not feasible to apply due to differences in quantities of each dataset, therefore it must be evaluated whether including or excluding attributes the accuracy rates are improved. Nevertheless, it is concluded that applied techniques achieve an ordered model that classify the DN in both datasets. On the other hand, the quantity of records must be reviewed in further analysis, since it is estimated that increasing the quantity of records improves performance. Dividing datasets in types, as indicated by the expert, is suitable since exams and the source where data is captured are different.

With respect to the use of the methodology in the health area, it is true that the activities are applicable. As the case study was in a reduced environment, activities where other clinical areas are involved were not necessary and did not affect the development of the research. However, the guide does not explicitly specify how to measure compliance of the model, since it is understood more like a guideline of good practices. For this reason, performance evaluation is carried out by verifying whether or not the activity was completed or if there are any observations. The strength of the guide is to encourage the use of tools for building models and datasets.

As future work, it is planned to increase the number of instances, in both types of datasets and to apply the same models developed, considering the reduction of attributes. The emphasis of the research will be on the transformation of time series. An alternative is to include the study of signal entropy. The focus will be on the study of the state of the art in signal transformations. The aim is to evaluate the feasibility of replicating the models in other similar problems.