Empirical Evaluation of Three Machine Learning Method for Automatic Classification of Neoplastic Diagnoses

AbstrAct Diagnoses are a valuable source of information for evaluating a health system. However, they are not used extensively by information systems because diagnoses are normally written in natural language. This work empirically evaluates three machine learning methods to automatically assign codes from the International Classification of Diseases (10th Revision) to 3,335 distinct diagnoses of neoplasms obtained from UMLS ®. This evaluation is conducted on three different types of preprocessing. The results are encouraging: a well-known rule induction method and maximum entropy models achieve 90% accuracy in a balanced cross-validation experiment.


IntRoductIon
Technology Assessment in Health Care (TAHC) improves considerably decision making in patient care, allowing greater efficiency in the use of resources and in people's quality of life [1].Evaluating medical technologies provides judging elements for the decision making authorities on the convenience of using, diffusing or accepting certain technologies.It also provides information to physicians and patients on the proper use of some technologies in specific health problems, and it orients hospitals on the most adequate solutions in terms of cost and effectiveness.TAHC is especially important in developing countries as they are normally consumers of technology and where health resources are more limited.
One of the main difficulties of TAHC is that it requires Risk Adjustment, which is the general term to refer to "accounting for patient-related factors before comparing outcomes of care" [2].In this analysis, "risk" does not correspond only to risk of death, but to a wider concept that falls into three broad areas: clinical outcomes of care (e.g.death, normal vision, etc.), resources used (e.g.length of stay) and patient-centred outcomes (e.g.satisfaction on care preferences).
The use of Risk Adjustment for measuring both efficiency and efficacy have recently acquired great relevance and it is even beginning to be considered in the calculation of insurance payments, the assignment of public resources, and the evaluation of health personnel [2,3].However, it has been found that current models for the estimation of Risk Adjustment have problems because they do not include complete and reliable diagnostic information.For example, studies in Chile have determined that 51% of the variation of a patient's stay in an intensive care unit can be attributed to the diagnosis and its morbidities [4], and that the prediction of mortality improves by 75% when these variables are added to those considered by the APACHE method, which is the internationally most widely used physiological index of seriousness [5].This has led physicians and health service administrators throughout the world to promote improvements in the processes of capturing the diagnostic information of their patients [6].
In medicine, language is a valuable representation and communication tool that can be used at all levels of the health system, affecting each of those levels according to its meaning.One of the main measures for evaluating the operation of the system is the diagnostic hypothesis or admission diagnosis, which includes the measure of the seriousness and complication of the patients' condition [4].
Having coded diagnoses is necessary not only to evaluate the seriousness of the patients' condition and get descriptive statistics, but it is also necessary for evaluating the effectiveness of the medical intervention and for generating predictive models of the operation of the health system [7].
Nowadays there is a large variety of controlled languages [8−10] that allow the standardisation of the process, but their large sizes and their variability turn simple lexicographic searches infeasible.For this reason, until now virtually all medical coding is done manually by people trained in both the medical field and the classification system in use, and most of the computer systems in this application exist only to support human coders [11].
As a step toward automatic coding of medical diagnoses, the performance of three different machine learning methods on classifying neoplastic diagnoses according to the International Classification of Diseases 10th Revision (ICD-10) [8] are studied.The choice of neoplastic diagnoses has two main advantages: it allows wide-range coverage of medical terminology because neoplastic alterations can occur anywhere in the body, and diagnoses coming from the field of pathology are considered to be definitive in medicine, providing the necessary templates to evaluate the system.

data souRcE
The diagnosis source used in this work corresponds to the data base provided by version 2004AB of the Unified Medical Language System ® (UMLS ® ) [10].The process begins by providing UMLS ® with the code of each neoplastic diagnosis contained in ICD-10.UMLS ® delivers a Concept Unique Identifier (CUI) for each of them, and these CUIs are then used to retrieve from the database all those diagnoses in Spanish that come from sources other than ICD-10.Figure 1 shows an example of this process for the ICD-10 code C22.9 MALIGNANT NEOPLASM OF LIVER, UNSPECIFIED.
After a data cleaning process that includes standardising the text to upper case, replacing accented vowels, and deleting punctuation signs and parentheses, 3,335 different diagnoses in natural language are obtained (e.g.CANCER DE HIGADO, TUMOR MALIGNO DE HIGADO).This corpus does not contain the original ICD-10 diagnoses.In the example of Figure 1, the diagnoses come from the Spanish versions of the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the Medical Subject Headings (MeSH) and the World Health Organisation Adverse Drug Reaction Terminology (WHOART).
In a second step, these diagnoses are processed to introduce a structure that can provide more information to the classifier using the idea of semantic category in which the words that may occur in a diagnosis are separated into thematic axes, in the style of SNOMED ® [9] .For this work, and in agreement with the classification system used in ICD-10, four axes have been considered: Pathological Function (PF), Idea or Concept (IC), Spatial Concept (SC) and Anatomical Structure (AS).In this way the word CARCINOMA is related explicitly in the data with words such as SARCOMA or TUMOR, since they all belong to the PF axis, and never with words like HIGADO, which is in the AS axis.Table 1 shows examples of the 1,019 different words contained in those axes.
The process of separating words into thematic axes is done automatically.Each word is subjected to the UMLS ® Semantic Network to determine its semantic category.Given this initial location, the network is navigated towards more abstract concepts until one of the four categories of Table 1 is found.Figure 2 presents two examples of this process for the words HIGADO (liver) and TUMOR.The former is assigned the category Body Part, Organ, or Organ Component by UMLS ® Semantic Network.Travelling up in the network, the concept Anatomical Structure is found and thus this word is assigned to the AS axis.Similarly the word TUMOR is found under the category Neoplastic Function which is related with the more abstract concept of Pathological Function and the PF axis is assigned to this word.
Although the separation of words into thematic axes reduces part of the ambiguity found in the medical diagnoses in natural language, it does not solve the use of different words to refer to the same concept.
Consider the examples in the first column of Table 2.These phrases, apparently different, refer to the same diagnosis.This fact is evident if the equivalences CANCER ≈ NEOPLASIA MALIGNA, TUMOR ≈ NEOPLASIA, MALIGNO ≈ MALIGNA and HEPATICO ≈ HEPATICA ≈ DE HIGADO 2 are considered. 2Spanish is a language with grammatical gender.In this example, the variations MALIGNO/MALIGNA and HEPATICO/ HEPATICA are used for masculine/feminine nouns.TUMOR is a masculine noun and NEOPLASIA is a feminine one.

machInE LEaRnIng mEthods
From a theoretical standpoint, assigning codes to pieces of text in a controlled vocabulary system can be seen as two different Natural Language Processing (NLP) tasks: Categorisation of Text, or Automatic Translation.Both problems are investigated actively around the world, and countless techniques and domains have been studied.This work considers taking the Text Categorisation perspective.This task can be seen as the assignment of a truth value to every diagnosis-code pair, where the codes must be taken from a predefined and finite set of labels.However, the task here is slightly different to the usual Text Categorisation tasks that can be found in the literature.Previous work normally consider hundreds of thousands of fairly large documentscontaining several paragraphs of free text-that have to be classified into a small number of categories.
In contrast, this study consider each diagnosis as a Table 3. Examples of entries contained in the manually built lexicon of [12].Consequently there is a third step in which words are replaced by numbers according to the lexicon manually built in [12] from a medical terminology dictionary.The numbers used in the lexicon do not represent meaningful relations.Table 3 shows some entries included in this lexicon.The process of assigning numbers to words according the lexicon, which will be referred as word encoding, was also carried out automatically.

Encoding text
Using these codes as preprocessing, the words CANCER, TUMOR and NEOPLASIA are replaced by the numbers <9, 11>, <9> and <9> respectively, so that the classifier could detect that the concept <9> is present in the three diagnoses.In fact, all three diagnoses of Table 2 contain the codes <9> (tumour) and <27> (liver), making more evident their similarity.
document -which normally is not even a complete sentence-and the number of categories is given by the ICD-10 system which defines more than 12,000 four-character codes.
Research in Text Categorisation has shifted in the last decades from the traditional NLP Knowledge Engineering paradigm, in which rules encoding expert knowledge are manually constructed, in favour of the Machine Leaning paradigm in which an inductive algorithm automatically builds a text classifier by learning the patterns that associate documents and the categories from a set of preclassified examples.
Most inductive machine learning approaches have been successfully applied for text classification [13], which can be allocated into three main paradigms: rule induction, probabilistic modelling and numerical optimisation.Considering that there is not enough research on classification of medical diagnoses to make a priori decisions, in this work three different algorithms are tested so that each method represents one of the above learning paradigms.

decision List
Induction of decision rules of the form if-then provides a learning method that is expressive and easy to read by human beings.In this work the Ripper 2.5 algorithm [14] is used, which learns propositional rules efficiently, even from large sets of noisy data, with a performance similar to that of more highly developed induction methods such as C4.5.

maximum Entropy models
A maximum entropy model (MEM) is a conditional probability distribution that adjusts its parameters to represent perfectly the training data by means of characteristic functions [15].From all the probabilistic models that fulfil this condition, the approach forces the selection of the one that has the maximum entropy.Therefore, the model does not make assumptions that are not supported by known information.To obtain and evaluate the maximum entropy models presented here, the MaxEnt 2.1.0library [16] has been used.

support Vector machines
The method proposed by Fan, Pai-Hsuen Chen and Chih-Jen Lin [17], and implemented in the LIBSVM This cross-validation allows the verification that the division of the data for the experiments is not generating biased results.
The Ripper implementation used to obtain the decision list generates and optimises a set of rules from the training set provided.In order to fairly compare the results of Ripper with the other machine learning methods, SVM classifiers and MEM classifiers have also been trained and parameterised from the data in the training subset only by using 10fold cross-validation.The final optimal parameters are reported for each case.
Ripper is used with most of its parameters set to default values, except that negative tests are allowed (-!s) and the algorithm is instructed to assume the data is noise-free (-c).Ripper has a nice feature: it can handle set-valued attributes, that is, attributes whose value is a set of strings.Thus Ripper can build rules of the form "if the string s occurs in S then …", where S is a set-valued attribute.Therefore when the data has W or Z preprocessing, a single set-valued attribute is used to model the data.When the data is separated into thematic axes, four setvalued attributes are used.
In these experiments, the Generalized Iterative Scaling algorithm [19] has been used to train the maximum entropy models.This algorithm requires two parameters: the number of times an instantiated characteristic function must be seen in order to be considered in the model (cutoff) and the number of times the training procedure should be repeated (iterations).The maximum entropy model chosen in each experiment corresponds to that having the best performance in a 10-fold cross-validation over the training data among the set of 18 models that result from training with a cutoff that varies from 1 to 3 and subjecting the training from 100 to 600 iterations in increments of 100.The characteristic functions used are atomic of the form:

REsuLts
Models were built in a common desktop computer with 1MB of memory.Ripper models could be trained in few minutes.MEM and SVM models took longer as they were parameterised with a 10-fold crossvalidation, though no model required more than 1 hour to be completed.All methods are very fast to be applied and the testing corpus was completely classified in few seconds by each model.
Table 5 shows the performance of the three machine learning methods for all experiments carried out.The fourth column shows the parameters used in each measurement, resulting from the 10-fold cross-validation on the training corpus.
The main observation derived from Table 5 is that all the algorithms can generate robust classifiers: all of them obtain accuracy greater than 80% with at least one of the preprocessing.This result is validated by the χ 2 test which indicates that there is not a significant difference between the classifiers based on the same paradigm when trained with corpus A or with corpus B (p ≥ 0.24 in all cases).
It can also be seen in this table that the different preprocessings have an important impact on the classifiers.In effect, each learning algorithm -namely Ripper, LIBSVM and MaxEnt-generates eight different classifiers (four preprocessings times two training corpora).Comparing the performance between the different classifiers generated by each algorithm, the McNemar test strongly indicates that 31 out of the 36 pairs present statistically significant differences.
Moreover, the SVM classifiers are specially affected by the preprocessing schemes as all versions show significant differences between them in terms of both accuracy and the examples misclassified (p = 0.00 in all cases).Separating word into thematic axes (preprocessing W+X) does not contribute to the classifiers based on Ripper and MEM -which do not present a statistical significant difference in accuracy with preprocessing W-but it negatively affects the ones based on SVM.To some extent this was expected because the preprocessing schemes that consider thematic axes introduce numerical variability in the input vectors, making more difficult for this kind of classifier to obtain an optimised separation.
The statistical tests also indicate that coding words (preprocessing Z) cannot be fully exploited by the Ripper algorithm (p ≥ 0.05 when compared with the corresponding Ripper-based classifier using preprocessing W), but it helps the SVM-based and MEM-based classifiers to obtain better performance (p ≤ 0.01 when tested against preprocessing W).This indicates the reduction in the size of the input vectors can be exploited by SVM method and the probability model built by the MEM algorithm, but it cannot be captured completely by the few hundreds of rules generated by the Ripper algorithm.
Separating encoded words into thematic axes (preprocessing Z+X) does yield an improvement in the accuracy obtained by the Ripper-based and MEMbased classifiers (p = 0.00 against preprocessing Z in all cases), but it significantly worsens the performance of the classifiers based on SVM (p = 0.00 against preprocessing Z in all cases).This suggests the reduction of lexical variability obtained through the preprocessing Z is made more apparent to the classifiers when combined with the separation in thematic axes.The drop in performance of the SVM classifiers with this preprocessing seems to be more related to the corresponding representation of the input -which are vectors of integer now instead of binary vectors-than to its own generalisation ability.

RELatEd WoRk
It is difficult to compare these results with previous studies as most of them use natural language processing techniques more extensively, mainly because the problem is oriented at identifying clinical information of interest in complete medical reports.The coding is done as a later stage with techniques as simple as string matching and look-up tables, or as complex as expert systems and Bayesian networks [20][21][22].
There has been some work that makes use of machine learning with methods such as k-nearest neighbour, decision lists, decision trees, and naïve Bayes classifiers [23][24].Although these attempts have been relatively successful at this task, most of them are not sufficiently reliable to replace human codifiers.
The work of Franz, Zaiss, Schulz, Hahn and Klar [21] is the closest to the one being presented here.They also attempted to codify, in ICD-9, sentences in German that represent medical diagnoses, in contrast with the other approaches that process text that is not so restricted.Franz, Zaiss, Schulz, Hahn and Klar [21] also evaluate three methods: the first is based on the similarity of all trigrams contained in the diagnosis; the second and third methods are based on the application of a morphological segmentation process and then they look up each term in SNOMED ® , and they differ in the technique for recovering the corresponding codes.
Franz, Zaiss, Schulz, Hahn and Klar report between 31% and 41% accuracy in the assignment of complete ICD-9 codes, far less than the performance achieved in this study.However, part of this difference is explained by the fact that Franz, Zaiss, Schulz, Hahn and Klar used actual diagnoses, as written by physicians, whereas the diagnoses used here were derived from controlled languages.

concLusIons and FutuRE WoRk
This study has successfully obtained two trainable approaches that automatically classified medical diagnoses in natural language with 90% accuracy.This performance is achieved when the words in each diagnosis are replaced with concept codes (preprocessing Z) and separated into thematic axes (preprocessing X).
This is an important contribution as these classifiers could constitute the core of a computer-assisted clinical coding system, which would undoubtedly reduce the time invested in the task.Indeed, the role of the human coder will be mainly the verification of the code assigned by the automatic system.Only when this code is wrongly selected, the human coder will have to look for the appropriate one.
Moreover, as one of the successful approaches is based on probabilistic models (MEM), an ordered ranking of possible codes for a diagnosis in natural language can be obtained.This feature might be exploited to build a computerised (sub-) system that would allow primary codification, that is, the person responsible for assigning the right code is the physician making the diagnosis.Only when the correct code is not included in the list of most probable codes, the codification needs to be secondary, in which a human coder has to interpret the diagnosis written by the physician.Such application would considerably reduce the time dedicated to this task by doctors, one of the main disadvantages of primary codification [25], whilst avoiding the consistency problems found in secondary codification [26].
One limitation of this study is that the encouraging results reported in this work are achieved with a corpus of diagnoses obtained from controlled languages.Although a decrease in the performance of the methods studied can be expected when evaluated with real diagnoses written by physicians, it is unlikely that this drop in performance will be so significant as to remove the advantage obtained with respect to previous methods.
There are several ways in which this research can be continued in future work.A point to be improved is that the preprocessing Z uses a vocabulary of synonyms built manually, which has two disadvantages.Firstly, it is difficult to build and maintain a complete dictionary and therefore the approach could be missing some relevant information.Indeed the presence of this kind of noise in the data has been noticed.Secondly, the portability of the approach is affected because in order to extend its functionality to another medical field -other than neoplasms-a new dictionary must be created.Morales reported that this lexicon took 30 days to be built [12].
Therefore, an important job to be carried out is to make the acquisition of this dictionary automatic.This requires, at least, work for detecting morphological variations, word segmentation and identification of synonymous terms.Future versions of UMLS ® might implement, for Spanish, the lexicographic tools that are available for the English language, making easier the automation of this preprocessing.
Another potential difficulty that must be addressed is the presence of typographical errors, acronyms and abbreviations in the diagnostic text.A preprocessing step aimed at correcting or expanding these tokens could be necessary before a diagnosis is presented to any classifiers.
Finally, the fact that different combinations of preprocessing/learning algorithm misclassify different diagnoses strongly suggests that a combination of the classifiers could yield an improvement in accuracy.

acknoWLEdgEmEnt
This work has been funded by DICYT grant 2070718 from Universidad de Santiago de Chile (Usach).

Table 1 .
Examples of words contained in each thematic Axis.

Table 2 .
Examples of two linguistic preprocessings applied to different versions of the same diagnosis.

Table 4 .
Examples of the two linguistic preprocessings of Table2when separation into thematic axes is applied.
[18] tool[18], is utilised to create the Support Vector Machine (SVM) model.This method uses Sequential Minimal Optimisation to decompose the kernel function matrix in order to solve a simple two-variable optimisation problem at each iteration.The Gaussian Radial Basis Function (RBF) kernel has been selected for the experiments.
k.a.2-fold cross-validation).Thus each experiment is conducted twice: the first time, a classifier trained only with the data contained in corpus A is obtained and corpus B is used to evaluate it; the second time, training is done with corpus B only and corpus A is used for testing it.

Table 5 .
Results of each experiment in terms of accuracy.