<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0718-3305</journal-id>
<journal-title><![CDATA[Ingeniare. Revista chilena de ingeniería]]></journal-title>
<abbrev-journal-title><![CDATA[Ingeniare. Rev. chil. ing.]]></abbrev-journal-title>
<issn>0718-3305</issn>
<publisher>
<publisher-name><![CDATA[Universidad de Tarapacá. Escuela Universitaria de Ingeniería Electrica - Electrónica]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0718-33052011000300006</article-id>
<article-id pub-id-type="doi">10.4067/S0718-33052011000300006</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Empirical evaluation of three machine learning method for automatic classification of neoplastic diagnoses]]></article-title>
<article-title xml:lang="es"><![CDATA[Evaluación empírica de tres métodos de aprendizaje automático para clasificar automáticamente diagnósticos de neoplasias]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Jara]]></surname>
<given-names><![CDATA[José Luis]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Chacón]]></surname>
<given-names><![CDATA[Max]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Zelaya]]></surname>
<given-names><![CDATA[Gonzalo]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Universidad de Santiago de Chile Departamento de Ingeniería Informática ]]></institution>
<addr-line><![CDATA[Santiago ]]></addr-line>
<country>Chile</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2011</year>
</pub-date>
<volume>19</volume>
<numero>3</numero>
<fpage>359</fpage>
<lpage>368</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.cl/scielo.php?script=sci_arttext&amp;pid=S0718-33052011000300006&amp;lng=en&amp;nrm=iso&amp;tlng=en"></self-uri><self-uri xlink:href="http://www.scielo.cl/scielo.php?script=sci_abstract&amp;pid=S0718-33052011000300006&amp;lng=en&amp;nrm=iso&amp;tlng=en"></self-uri><self-uri xlink:href="http://www.scielo.cl/scielo.php?script=sci_pdf&amp;pid=S0718-33052011000300006&amp;lng=en&amp;nrm=iso&amp;tlng=en"></self-uri><abstract abstract-type="short" xml:lang="es"><p><![CDATA[Los diagnósticos médicos son una fuente valiosa de información para evaluar el funcionamiento de un sistema de salud. Sin embargo, su utilización en sistemas de información se ve dificultada porque éstos se encuentran normalmente escritos en lenguaje natural. Este trabajo evalúa empíricamente tres métodos de Aprendizaje Automático para asignar códigos de acuerdo a la Clasificación Internacional de Enfermedades (décima versión) a 3.335 diferentes diagnósticos de neoplasias extraídos desde UMLS®. Esta evaluación se realiza con tres tipos distintos de preprocesamiento. Los resultados son alentadores: un conocido método de inducción de reglas de decisión y modelos de entropía máxima obtienen alrededor de 90% accuracy en una validación cruzada balanceada.]]></p></abstract>
<abstract abstract-type="short" xml:lang="en"><p><![CDATA[Diagnoses are a valuable source of information for evaluating a health system. However, they are not used extensively by information systems because diagnoses are normally written in natural language. This work empirically evaluates three machine learning methods to automatically assign codes from the International Classification of Diseases (10th Revision) to 3,335 distinct diagnoses of neoplasms obtained from UMLS®. This evaluation is conducted on three different types of preprocessing. The results are encouraging: a well-known rule induction method and maximum entropy models achieve 90% accuracy in a balanced cross-validation experiment.]]></p></abstract>
<kwd-group>
<kwd lng="es"><![CDATA[Codificación clínica]]></kwd>
<kwd lng="es"><![CDATA[vocabulario controlado]]></kwd>
<kwd lng="es"><![CDATA[clasificación internacional de enfermedades]]></kwd>
<kwd lng="es"><![CDATA[aprendizaje por máquina]]></kwd>
<kwd lng="es"><![CDATA[procesamiento de lenguaje natural]]></kwd>
<kwd lng="en"><![CDATA[Clinical coding]]></kwd>
<kwd lng="en"><![CDATA[controlled vocabulary]]></kwd>
<kwd lng="en"><![CDATA[international classification of diseases]]></kwd>
<kwd lng="en"><![CDATA[machine learning]]></kwd>
<kwd lng="en"><![CDATA[natural language processing]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[  	    <p align="justify"><font face="verdana" size="2">Ingeniare. Revista chilena de ingenier&iacute;a, vol. 19 N&ordm; 3, 2011, pp. 359&#45;368</font></p> 	    <p align="right"><strong><font size="2" face="verdana">ART&Iacute;CULOS</font></strong></p> 	    <p align="left"><font face="verdana" size="4"><b>Empirical evaluation of three machine learning method for automatic classification of neoplastic diagnoses</b></font></p> 	    <p align="left">&nbsp;</p> 	    <p align="left"><strong><font face="verdana" size="3"><i>Evaluaci&oacute;n emp&iacute;rica de tres m&eacute;todos de aprendizaje autom&aacute;tico para clasificar autom&aacute;ticamente diagn&oacute;sticos de neoplasias</i></font></strong></p> 	    <p align="left">&nbsp;</p> 	    <p align="left"><strong><font face="verdana" size="2">Jos&eacute; Luis Jara<sup>1</sup> Max Chac&oacute;n<sup>1</sup> Gonzalo Zelaya<sup>1</sup></font></strong></p> 	    <p align="left">&nbsp;</p> 	    <p align="left"><font face="verdana" size="2"><sup>1</sup>Universidad de Santiago de Chile. Departamento de Ingenier&iacute;a Inform&aacute;tica. Avda. Ecuador 3659, 9170124 Estaci&oacute;n Central, Santiago, Chile. Email: <a href="mailto:jljara@usach.cl">jljara@usach.cl</a>; <a href="mailto:max.chacon@usach.cl">max.chacon@usach.cl</a>; <a href="mailto:gonzalo.zelaya@gmail.com">gonzalo.zelaya@gmail.com</a></font></p> 	<hr align="left" width="100%" size="1" noshade> 	    ]]></body>
<body><![CDATA[<p align="left"><font face="verdana" size="2"><b>RESUMEN</b></font></p>  	    <p align="left"><font face="verdana" size="2">Los diagn&oacute;sticos m&eacute;dicos son una fuente valiosa de informaci&oacute;n para evaluar el funcionamiento de un sistema de salud. Sin embargo, su utilizaci&oacute;n en sistemas de informaci&oacute;n se ve dificultada porque &eacute;stos se encuentran normalmente escritos en lenguaje natural. Este trabajo eval&uacute;a emp&iacute;ricamente tres m&eacute;todos de Aprendizaje Autom&aacute;tico para asignar c&oacute;digos de acuerdo a la Clasificaci&oacute;n Internacional de Enfermedades &#40;d&eacute;cima versi&oacute;n&#41; a 3.335 diferentes diagn&oacute;sticos de neoplasias extra&iacute;dos desde UMLS&reg;. Esta evaluaci&oacute;n se realiza con tres tipos distintos de preprocesamiento. Los resultados son alentadores: un conocido m&eacute;todo de inducci&oacute;n de reglas de decisi&oacute;n y modelos de entrop&iacute;a m&aacute;xima obtienen alrededor de 90&#37; accuracy en una validaci&oacute;n cruzada balanceada.</font></p>  	    <p align="left"><font face="verdana" size="2"><strong>Palabras claves:</strong> Codificaci&oacute;n cl&iacute;nica, vocabulario controlado, clasificaci&oacute;n internacional de enfermedades, aprendizaje por m&aacute;quina, procesamiento de lenguaje natural.</font></p> 	<hr align="left" width="100%" size="1" noshade> 	    <p align="left"><font face="verdana" size="2"><b><i>ABSTRACT</i></b></font></p> 	    <p align="left"><font face="verdana" size="2"><i>Diagnoses are a valuable source of information for evaluating a health system. However, they are not used extensively by information systems because diagnoses are normally written in natural language. This work empirically evaluates three machine learning methods to automatically assign codes from the International Classification of Diseases &#40;10th Revision&#41; to 3,335 distinct diagnoses of neoplasms obtained from UMLS<sup>&reg;</sup>. This evaluation is conducted on three different types of preprocessing. The results are encouraging: a well&#45;known rule induction method and maximum entropy models achieve 90&#37; accuracy in a balanced cross&#45;validation experiment.</i></font></p>  	    <p align="left"><font face="verdana" size="2"><i><strong>Keywords:</strong> Clinical coding, controlled vocabulary, international classification of diseases, machine learning, natural language processing.</i></font></p> 	<hr align="left" width="100%" size="1" noshade> 	    <p align="left"><font face="verdana" size="3"><b>INTRODUCTION</b></font></p> 	    <p align="left"><font face="verdana" size="2">Technology Assessment in Health Care &#40;TAHC&#41; improves considerably decision making in patient care, allowing greater efficiency in the use of resources and in people's quality of life &#91;1&#93;. Evaluating medical technologies provides judging elements for the decision making authorities on the convenience of using, diffusing or accepting </font><font face="verdana" size="2">certain technologies. It also provides information to physicians and patients on the proper use of some technologies in specific health problems, and it orients hospitals on the most adequate solutions in terms of cost and effectiveness. TAHC is especially important in developing countries as they are normally consumers of technology and where health resources are more limited.</font></p>  	    <p align="left"><font face="verdana" size="2">One of the main difficulties of TAHC is that it requires Risk Adjustment, which is the general term to refer to "accounting for patient&#45;related factors before comparing outcomes of care" &#91;2&#93;. In this analysis, &#34;risk&#34; does not correspond only to risk of death, but to a wider concept that falls into three broad areas: clinical outcomes of care &#40;e.g. death, normal vision, etc.&#41;, resources used &#40;e.g. length of stay&#41; and patient&#45;centred outcomes &#40;e.g. satisfaction on care preferences&#41;.</font></p>  	    <p align="left"><font face="verdana" size="2">The use of Risk Adjustment for measuring both efficiency and efficacy have recently acquired great relevance and it is even beginning to be considered in the calculation of insurance payments, the assignment of public resources, and the evaluation of health personnel &#91;2, 3&#93;. However, it has been found that current models for the estimation of Risk Adjustment have problems because they do not include complete and reliable diagnostic information. For example, studies in Chile have determined that 51&#37; of the variation of a patient's stay in an intensive care unit can be attributed to the diagnosis and its morbidities &#91;4&#93;, and that the prediction of mortality improves by 75&#37; when these variables are added to those considered by the APACHE method, which is the internationally most widely used physiological index of seriousness &#91;5&#93;. This has led physicians and health service administrators throughout the world to promote improvements in the processes of capturing the diagnostic information of their patients &#91;6&#93;.</font></p>  	    ]]></body>
<body><![CDATA[<p align="left"><font face="verdana" size="2">In medicine, language is a valuable representation and communication tool that can be used at all levels of the health system, affecting each of those levels according to its meaning. One of the main measures for evaluating the operation of the system is the diagnostic hypothesis or admission diagnosis, which includes the measure of the seriousness and complication of the patients' condition &#91;4&#93;.</font></p>  	    <p align="left"><font face="verdana" size="2">Having coded diagnoses is necessary not only to evaluate the seriousness of the patients' condition and get descriptive statistics, but it is also necessary for evaluating the effectiveness of the medical intervention and for generating predictive models of the operation of the health system &#91;7&#93;.</font></p>  	    <p align="left"><font face="verdana" size="2">Nowadays there is a large variety of controlled languages &#91;8&#45;10&#93; that allow the standardisation of </font><font face="verdana" size="2">the process, but their large sizes and their variability turn simple lexicographic searches infeasible. For this reason, until now virtually all medical coding is done manually by people trained in both the medical field and the classification system in use, and most of the computer systems in this application exist only to support human coders &#91;11&#93;.</font></p>  	    <p align="left"><font face="verdana" size="2">As a step toward automatic coding of medical diagnoses, the performance of three different machine learning methods on classifying neoplastic diagnoses according to the International Classification of Diseases 10th Revision &#40;ICD&#45;10&#41; &#91;8&#93; are studied. The choice of neoplastic diagnoses has two main advantages: it allows wide&#45;range coverage of medical terminology because neoplastic alterations can occur anywhere in the body, and diagnoses coming from the field of pathology are considered to be definitive in medicine, providing the necessary templates to evaluate the system.</font></p>  	    <p align="left"><strong><font face="verdana" size="3">DATA SOURCE</font></strong></p>  	    <p align="left"><font face="verdana" size="2">The diagnosis source used in this work corresponds to the data base provided by version 2004AB of the Unified Medical Language System&reg; &#40;UMLS&reg;&#41; &#91;10&#93;. The process begins by providing UMLS&reg; with the code of each neoplastic diagnosis contained in ICD&#45;10. UMLS&reg; delivers a Concept Unique Identifier &#40;CUI&#41; for each of them, and these CUIs are then used to retrieve from the database all those diagnoses in Spanish that come from sources other than ICD&#45;10. <a href="#fig01">Figure 1</a> shows an example of this process for the ICD&#45;10 code C22.9 MALIGNANT NEOPLASM OF LIVER, UNSPECIFIED.</font><font face="verdana" size="2"><a name="fig01"></a></font></p>     <p align="center"><font face="verdana" size="2"><img src="/fbpe/img/ingeniare/v19n3/art06-fig01.jpg" width="320" height="319"></font><font face="verdana" size="2">    
<br> 	</font><font face="verdana" size="2">Figure 1. Retrieving diagnoses written in Spanish from the UMLS&reg; database. The first query obtains the CUI of the diagnosis with the C22.9 code in ICD&#45;10.</font></p> 	    <p align="left"><font face="verdana" size="2">After a data cleaning process that includes standardising the text to upper case, replacing accented vowels, and deleting punctuation signs and parentheses, 3,335 different diagnoses in natural language are obtained &#40;e.g. CANCER DE HIGADO, </font><font face="verdana" size="2">TUMOR MALIGNO DE HIGADO&#41;. This corpus </font><font face="verdana" size="2">does not contain the original ICD&#45;10 diagnoses. In the example of <a href="#fig01">Figure 1</a>, the diagnoses come from the Spanish versions of the Medical Dictionary for Regulatory Activities Terminology &#40;MedDRA&#41;, the Medical Subject Headings &#40;MeSH&#41; and the World Health Organisation Adverse Drug Reaction </font><font face="verdana" size="2">Terminology &#40;WHOART&#41;.</font></p>  	    <p align="left"><font face="verdana" size="2">In a second step, these diagnoses are processed to introduce a structure that can provide more information to the classifier using the idea of semantic category in which the words that may occur in a diagnosis are separated into thematic axes, in the style of SNOMED&reg; &#91;9&#93;. For this work, and in agreement with the classification system used in ICD&#45;10, four axes have been considered: Pathological Function &#40;PF&#41;, Idea or Concept &#40;IC&#41;, Spatial Concept &#40;SC&#41; and Anatomical Structure &#40;AS&#41;. In this way the word CARCINOMA is related explicitly in the data with words such as SARCOMA or TUMOR, since they all belong to the PF axis, and never with words like HIGADO, which is in the AS axis. <a href="#tab01">Table 1</a> shows examples of the 1,019 different words contained in those axes.</font></p> 	    ]]></body>
<body><![CDATA[<p align="center"><font face="verdana" size="2"><a name="tab01"></a>Table 1. Examples of words contained in each thematic Axis.    <br>     <img src="/fbpe/img/ingeniare/v19n3/art06-tab01.jpg" width="320" height="222">	</font></p> 	    
<p align="left"><font face="verdana" size="2">The process of separating words into thematic axes is done automatically. Each word is subjected to the UMLS&reg; Semantic Network to determine its semantic category. Given this initial location, the network is navigated towards more abstract concepts until one of the four categories of <a href="#tab01">Table 1</a> is found. <a href="#fig02">Figure 2</a> presents two examples of this process </font><font face="verdana" size="2">for the words HIGADO &#40;liver&#41; and TUMOR. The </font><font face="verdana" size="2">former is assigned the category Body Part, Organ, or Organ Component by UMLS&reg; Semantic Network. Travelling up in the network, the concept Anatomical </font><font face="verdana" size="2">Structure is found and thus this word is assigned to the AS axis. Similarly the word TUMOR is found under the category Neoplastic Function which is related with the more abstract concept of Pathological Function and the PF axis is assigned to this word.</font><font face="verdana" size="2"><a name="fig02"></a></font></p>     <p align="center"><img src="/fbpe/img/ingeniare/v19n3/art06-fig02.jpg" width="320" height="165">    
<br>     <font face="verdana" size="2">Figure 2. Determining the semantic axis of the words higado and tumor.</font></p> 	    <p align="left"><font face="verdana" size="2">Although the separation of words into thematic axes reduces part of the ambiguity found in the medical diagnoses in natural language, it does not solve the use of different words to refer to the same concept. Consider the examples in the first column of <a href="#tab02">Table 2</a>. These phrases, apparently different, refer to the same diagnosis. This fact is evident if the equivalences CANCER &laquo; NEOPLASIA MALIGNA, TUMOR &laquo; NEOPLASIA, MALIGNO &laquo; MALIGNA and HEPATICO &laquo; HEPATICA &laquo; DE HIGADO<sup><a name="n02"></a><a href="#nota02">2</a></sup> are considered.</font></p> 	    <p align="center"><font face="verdana" size="2"><a name="tab02"></a>Table 2. Examples of two linguistic preprocessings applied to different versions of the same diagnosis.    <br>     <img src="/fbpe/img/ingeniare/v19n3/art06-tab02.jpg" width="250" height="282">	</font></p>  	    
<p align="left"><font face="verdana" size="2">Consequently there is a third step in which words are replaced by numbers according to the lexicon manually built in &#91;12&#93; from a medical terminology dictionary. The numbers used in the lexicon do not represent meaningful relations. <a href="#tab03">Table 3</a> shows some entries included in this lexicon. The process of assigning numbers to words according the lexicon, which will be referred as word encoding, was also carried out automatically.</font></p> 	    <p align="center"><font face="verdana" size="2"><a name="tab03"></a>Table 3. Examples of entries contained in the manually built lexicon of &#91;12&#93;.    ]]></body>
<body><![CDATA[<br>     <img src="/fbpe/img/ingeniare/v19n3/art06-tab03.jpg" width="320" height="204">	</font></p> 	    
<p align="left"><font face="verdana" size="2">Using these codes as preprocessing, the words CANCER, TUMOR and NEOPLASIA are replaced by the numbers &lt;9, 11&gt;, &lt;9&gt; and &lt;9&gt; respectively, so that the classifier could detect that the concept &lt;9&gt; is present in the three diagnoses. In fact, all three diagnoses of <a href="#tab02">Table 2</a> contain the codes &lt;9&gt; &#40;tumour&#41; and &lt;27&gt; &#40;liver&#41;, making more evident their similarity.</font></p>  	    <p align="left"><font face="verdana" size="2">Thus four sets of data have been obtained for the experiments: diagnoses in words &#40;W&#41;, diagnoses in words separated into thematic axes &#40;W+X&#41;, diagnoses with encoded words &#40;Z&#41;, and diagnoses with encoded words separated into thematic axes &#40;Z+X&#41;. <a href="#tab04">Table 4</a> presents the example diagnoses of <a href="#tab02">Table 2</a> when the thematic axes are considered.</font></p>  	    <p align="center"><font face="verdana" size="2"><a name="tab04"></a>Table 4. Examples of the two linguistic preprocessings of <a href="#tab02">Table 2</a> when separation into thematic axes is applied.    <br>     <img src="/fbpe/img/ingeniare/v19n3/art06-tab04.jpg" width="320" height="381">	</font></p> 	    
<p align="left"><font face="verdana" size="3"><b>MACHINE LEARNING METHODS</b></font></p> 	    <p align="left"><font face="verdana" size="2">From a theoretical standpoint, assigning codes to pieces of text in a controlled vocabulary system can be seen as two different Natural Language Processing &#40;NLP&#41; tasks: Categorisation of Text, or Automatic Translation. Both problems are investigated actively around the world, and countless techniques and domains have been studied. This work considers taking the Text Categorisation perspective. This task can be seen as the assignment of a truth value to every diagnosis&#45;code pair, where the codes must be taken from a predefined and finite set of labels. However, the task here is slightly different to the usual Text Categorisation tasks that can be found in the literature. Previous work normally consider hundreds of thousands of fairly large documents &#45;containing several paragraphs of free text&#45; that have to be classified into a small number of categories. In contrast, this study consider each diagnosis as a </font><font face="verdana" size="2">document &#45;which normally is not even a complete sentence&#45; and the number of categories is given by the ICD&#45;10 system which defines more than 12,000 four&#45;character codes.</font></p>  	    <p align="left"><font face="verdana" size="2">Research in Text Categorisation has shifted in the last decades from the traditional NLP Knowledge Engineering paradigm, in which rules encoding expert knowledge are manually constructed, in favour of the Machine Leaning paradigm in which an inductive algorithm automatically builds a text classifier by learning the patterns that associate documents and the categories from a set of pre&#45;classified examples.</font></p>  	    <p align="left"><font face="verdana" size="2">Most inductive machine learning approaches have been successfully applied for text classification &#91;13&#93;, which can be allocated into three main paradigms: rule induction, probabilistic modelling and numerical optimisation. Considering that there is not enough research on classification of medical diagnoses to make a priori decisions, in this work three different algorithms are tested so that each method represents one of the above learning paradigms.</font></p>  	    <p align="left"><font face="verdana" size="2"><b>Decision List    ]]></body>
<body><![CDATA[<br> 	</b></font><font face="verdana" size="2">Induction of decision rules of the form if&#45;then provides a learning method that is expressive and easy to read by human beings. In this work the Ripper 2.5 algorithm &#91;14&#93; is used, which learns propositional rules efficiently, even from large sets of noisy data, with a performance similar to that of more highly developed induction methods such as C4.5.</font></p>  	    <p align="left"><font face="verdana" size="2"><b>Maximum Entropy Models    <br> 	</b></font><font face="verdana" size="2">A maximum entropy model &#40;MEM&#41; is a conditional probability distribution that adjusts its parameters to represent perfectly the training data by means of characteristic functions &#91;15&#93;. From all the probabilistic models that fulfil this condition, the approach forces the selection of the one that has the maximum entropy. Therefore, the model does not make assumptions that are not supported by known information. To obtain and evaluate the maximum entropy models presented here, the MaxEnt 2.1.0 library &#91;16&#93; has been used.</font></p>  	    <p align="left"><font face="verdana" size="2"><b>Support Vector Machines    <br> 	</b></font><font face="verdana" size="2">The method proposed by Fan, Pai&#45;Hsuen Chen and Chih&#45;Jen Lin &#91;17&#93;, and implemented in the LIBSVM </font><font face="verdana" size="2">2.82 tool &#91;18&#93;, is utilised to create the Support Vector Machine &#40;SVM&#41; model. This method uses Sequential Minimal Optimisation to decompose the kernel function matrix in order to solve a simple two&#45;variable optimisation problem at each iteration. The Gaussian Radial Basis Function &#40;RBF&#41; kernel has been selected for the experiments.</font></p>  	    <p align="left"><font face="verdana" size="3"><b>METHOD</b></font></p>  	    <p align="left"><font face="verdana" size="2">For the experiments, the corpus was randomly divided into two disjoint subsets trying to balance the number of examples for each class &#40;i.e. each ICD&#45;10 code&#41;. In this way two non&#45;overlapping, annotated corpora are obtained, labelled A and B, with the purpose of carrying out a balanced cross&#45;validation &#40;a.k.a. 2&#45;fold cross&#45;validation&#41;. Thus each experiment is conducted twice: the first time, a classifier trained only with the data contained in corpus A is obtained and corpus B is used to evaluate it; the second time, training is done with corpus B only and corpus A is used for testing it. This cross&#45;validation allows the verification that the division of the data for the experiments is not generating biased results.</font></p>  	    <p align="left"><font face="verdana" size="2">The Ripper implementation used to obtain the decision list generates and optimises a set of rules from the training set provided. In order to fairly compare the results of Ripper with the other machine learning methods, SVM classifiers and MEM classifiers have also been trained and parameterised from the data in the training subset only by using 10&#45;fold cross&#45;validation. The final optimal parameters are reported for each case.</font></p>  	    <p align="left"><font face="verdana" size="2">Ripper is used with most of its parameters set to default values, except that negative tests are allowed <i>&#40;&#45;!s&#41;</i> and the algorithm is instructed to assume the data is noise&#45;free &#40;&#45;c&#41;. Ripper has a nice feature: it can handle set&#45;valued attributes, that is, attributes whose value is a set of strings. Thus Ripper can build rules of the form "if the string <i>s</i> occurs in <i>S</i> then ...", where <i>S</i> is a <i>set&#45;valued attribute.</i> Therefore when the data has W or Z preprocessing, a single set&#45;valued attribute is used to model the data. When the data is separated into thematic axes, four set&#45;valued attributes are used.</font></p>  	    <p align="left"><font face="verdana" size="2">In these experiments, the Generalized Iterative Scaling algorithm &#91;19&#93; has been used to train the </font><font face="verdana" size="2">maximum entropy models. This algorithm requires two parameters: the number of times an instantiated characteristic function must be seen in order to be considered in the model <i>&#40;cutoff&#41;</i> and the number of times the training procedure should be repeated <i>&#40;iterations&#41;.</i> The maximum entropy model chosen in each experiment corresponds to that having the best performance in a 10&#45;fold cross&#45;validation over the training data among the set of 18 models that result from training with a cutoff that varies from 1 to 3 and subjecting the training from 100 to 600 iterations in increments of 100. The characteristic functions used are atomic of the form:</font></p>  	    ]]></body>
<body><![CDATA[<p align="left"><font face="verdana" size="2"><img src="/fbpe/img/ingeniare/v19n3/art06-fxy.jpg" width="260" height="37"></font></p>  	    
<p align="left"><font face="verdana" size="2">When the data is separated into thematic axes, the characteristic functions consider this information and are triggered only if they belong to the axis of interest.</font></p>  	    <p align="left"><font face="verdana" size="2">LIBSVM provides a visual tool for searching the best parameters for the model. In this case only two parameters must be adjusted: the cost that controls the proportion of misclassification allowed during training &#40;c&#41; and the width of the Gaussian RBF kernel &#40;T&#41;. The parameters reported in the experiments are those that yield the best performance in a 10&#45;fold cross&#45;validation on the training corpus after two search processes: after a broad initial search, a more focused search around the best initial parameters was performed. The data with W or Z preprocessing are represented as a binary input vector for LIBSVM in which the <i>i</i>th position of the vector has a value of 1 if the <i>i</i>th word &#45;or word code&#45; is present in the diagnosis, and a value of 0 otherwise. For the data with W+X or Z+X preprocessing a non&#45;binary input vector is used in which each axis has a number of positions equal to the maximum number of words that occur simultaneously in a diagnosis. Each position is filled with an integer number that represents the word or word code. Unused positions are filled with a value of 0.</font></p>  	    <p align="left"><font face="verdana" size="2">The performance of each classifier is measured in terms of accuracy, in disfavour of the more classical recall and precision, because the corpus contains positive examples only. Additionally, two statistical tests are used to evaluate the significance of the results. On the one hand, the differences in accuracy &#45;i.e. considering only the proportion of examples misclassified&#45; are assessed with the &#967;<sup>2</sup> test for equality of distributions. On the other hand, the non&#45;parametric McNemar test is applied to determine whether differences in the examples wrongly classified are significant. In all tests, a 5&#37; nominal level &#40;p &lt; 0.05&#41; is considered significant.</font></p>  	    <p align="left"><font face="verdana" size="3"><b>RESULTS</b></font></p>  	    <p align="left"><font face="verdana" size="2">Models were built in a common desktop computer with 1MB of memory. Ripper models could be trained in few minutes. MEM and SVM models took longer as they were parameterised with a 10&#45;fold cross&#45;validation, though no model required more than 1 hour to be completed. All methods are very fast to be applied and the testing corpus was completely classified in few seconds by each model.</font></p>  	    <p align="left"><font face="verdana" size="2"><a href="#tab05">Table 5</a> shows the performance of the three machine learning methods for all experiments carried out. The fourth column shows the parameters used in each measurement, resulting from the 10&#45;fold cross&#45;validation on the training corpus.</font></p> 	    <p align="center"><font face="verdana" size="2"><a name="tab05"></a>Table 5. Results of each experiment in terms of accuracy.    <br>     <img src="/fbpe/img/ingeniare/v19n3/art06-tab05.jpg" width="570" height="485">	</font></p> 	    
<p align="left"><font face="verdana" size="2">The main observation derived from <a href="#tab05">Table 5</a> is that all the algorithms can generate robust classifiers: all of them obtain accuracy greater than 80&#37; with at least one of the preprocessing. This result is validated by the &#967;<sup>2</sup> test which indicates that there is not a significant difference between the classifiers based on the same paradigm when trained with corpus A or with corpus B &#40;p &gt; 0.24 in all cases&#41;.</font></p>  	    ]]></body>
<body><![CDATA[<p align="left"><font face="verdana" size="2">It can also be seen in this table that the different preprocessings have an important impact on the classifiers. In effect, each learning algorithm &#45;namely Ripper, LIBSVM and MaxEnt&#45; generates eight different classifiers &#40;four preprocessings times two training corpora&#41;. Comparing the performance between the different classifiers generated by each algorithm, the McNemar test strongly indicates that 31 out of the 36 pairs present statistically significant differences.</font></p>  	    <p align="left"><font face="verdana" size="2">Moreover, the SVM classifiers are specially affected by the preprocessing schemes as all versions show significant differences between them in terms of both accuracy and the examples misclassified &#40;p = 0.00 in all cases&#41;. Separating word into thematic axes &#40;preprocessing W+X&#41; does not contribute to </font><font face="verdana" size="2">the classifiers based on Ripper and MEM &#45;which do not present a statistical significant difference in accuracy with preprocessing W&#45; but it negatively affects the ones based on SVM. To some extent this was expected because the preprocessing schemes that consider thematic axes introduce numerical variability in the input vectors, making more difficult for this kind of classifier to obtain an optimised separation.</font></p>  	    <p align="left"><font face="verdana" size="2">The statistical tests also indicate that coding words &#40;preprocessing Z&#41; cannot be fully exploited by the Ripper algorithm <i>&#40;p</i> &gt; 0.05 when compared with the corresponding Ripper&#45;based classifier using preprocessing W&#41;, but it helps the SVM&#45;based and MEM&#45;based classifiers to obtain better performance &#40;p &lt; 0.01 when tested against preprocessing W&#41;. This indicates the reduction in the size of the input vectors can be exploited by SVM method and the </font><font face="verdana" size="2">probability model built by the MEM algorithm, but it cannot be captured completely by the few hundreds of rules generated by the Ripper algorithm.</font></p>  	    <p align="left"><font face="verdana" size="2">Separating encoded words into thematic axes &#40;preprocessing Z+X&#41; does yield an improvement in the accuracy obtained by the Ripper&#45;based and MEM&#45;based classifiers <i>&#40;p =</i> 0.00 against preprocessing Z in all cases&#41;, but it significantly worsens the performance of the classifiers based on SVM <i>&#40;p</i> = 0.00 against preprocessing Z in all cases&#41;. This suggests the reduction of lexical variability obtained through the preprocessing Z is made more apparent to the classifiers when combined with the separation in thematic axes. The drop in performance of the SVM classifiers with this preprocessing seems to be more related to the corresponding representation of the input &#45;which are vectors of integer now instead of binary vectors&#45; than to its own generalisation ability.</font></p>  	    <p align="left"><font face="verdana" size="2">Consequently, the best classification are obtained with the preprocessing Z+X by the models based on Ripper and MEM, which do not present a statistical significant difference between them &#40;p &gt; 0.22 in both tests&#41;.</font></p>  	    <p align="left"><font face="verdana" size="3"><b>RELATED WORK</b></font></p>  	    <p align="left"><font face="verdana" size="2">It is difficult to compare these results with previous studies as most of them use natural language processing techniques more extensively, mainly because the problem is oriented at identifying clinical information of interest in complete medical reports. The coding is done as a later stage with techniques as simple as string matching and look&#45;up tables, or as complex as expert systems and Bayesian </font><font face="verdana" size="2">networks &#91;20&#45;22&#93;.</font></p>  	    <p align="left"><font face="verdana" size="2">There has been some work that makes use of machine learning with methods such as &#094;&#45;nearest neighbour, decision lists, decision trees, and naive Bayes classifiers &#91;23&#45;24&#93;. Although these attempts have been relatively successful at this task, most of them are not sufficiently reliable to replace human codifiers.</font></p>  	    <p align="left"><font face="verdana" size="2">The work of Franz, Zaiss, Schulz, Hahn and Klar &#91;21&#93; is the closest to the one being presented here. They also attempted to codify, in ICD&#45;9, sentences in German that represent medical diagnoses, in contrast with the other approaches that process text that is not so restricted. Franz, Zaiss, Schulz, Hahn and Klar &#91;21&#93; also evaluate three methods: the first is based on the similarity of all trigrams contained in the diagnosis; the second and third methods are based on the application of a morphological segmentation process and then they look up each term in SNOMED&reg;, and they differ in the technique for recovering the corresponding codes.</font></p>  	    <p align="left"><font face="verdana" size="2">Franz, Zaiss, Schulz, Hahn and Klar report between 31&#37; and 41&#37; accuracy in the assignment of complete ICD&#45;9 codes, far less than the performance achieved in this study. However, part of this difference is explained by the fact that Franz, Zaiss, Schulz, Hahn and Klar used actual diagnoses, as written by physicians, whereas the diagnoses used here were derived from controlled languages.</font></p>  	    ]]></body>
<body><![CDATA[<p align="left"><font face="verdana" size="3"><b>CONCLUSIONS AND FUTURE WORK</b></font></p> 	    <p align="left"><font face="verdana" size="2">This study has successfully obtained two trainable approaches that automatically classified medical diagnoses in natural language with 90&#37; accuracy. This performance is achieved when the words in each diagnosis are replaced with concept codes &#40;preprocessing Z&#41; and separated into thematic axes &#40;preprocessing X&#41;.</font></p>  	    <p align="left"><font face="verdana" size="2">This is an important contribution as these classifiers could constitute the core of a computer&#45;assisted clinical coding system, which would undoubtedly reduce the time invested in the task. Indeed, the role of the human coder will be mainly the verification of the code assigned by the automatic system. Only when this code is wrongly selected, the human coder will have to look for the appropriate one.</font></p>  	    <p align="left"><font face="verdana" size="2">Moreover, as one of the successful approaches is based on probabilistic models &#40;MEM&#41;, an ordered ranking of possible codes for a diagnosis in natural language can be obtained. This feature might be exploited to build a computerised &#40;sub&#45;&#41; system that would allow <i>primary codification,</i> that is, the person responsible for assigning the right code is the physician making the diagnosis. Only when the correct code is not included in the list of most probable codes, the codification needs to be <i>secondary,</i> in which a human coder has to <i>interpret</i> the diagnosis written by the physician. Such application would considerably reduce the time dedicated to this task by doctors, one of the main disadvantages of primary codification &#91;25&#93;, whilst avoiding the consistency problems found in secondary codification &#91;26&#93;.</font></p>  	    <p align="left"><font face="verdana" size="2">One limitation of this study is that the encouraging results reported in this work are achieved with a corpus of diagnoses obtained from controlled languages. Although a decrease in the performance of the methods studied can be expected when evaluated with real diagnoses written by physicians, it is unlikely that this drop in performance will be so significant as to remove the advantage obtained with respect to previous methods.</font></p>  	    <p align="left"><font face="verdana" size="2">There are several ways in which this research can be continued in future work. A point to be improved is that the preprocessing Z uses a vocabulary of synonyms built manually, which has two disadvantages. Firstly, it is difficult to build </font><font face="verdana" size="2">and maintain a complete dictionary and therefore the approach could be missing some relevant information. Indeed the presence of this kind of noise in the data has been noticed. Secondly, the portability of the approach is affected because in order to extend its functionality to another medical field &#45;other than neoplasms&#45; a new dictionary must be created. Morales reported that this lexicon took </font><font face="verdana" size="2">30 days to be built &#91;12&#93;.</font></p>  	    <p align="left"><font face="verdana" size="2">Therefore, an important job to be carried out is to make the acquisition of this dictionary automatic. This requires, at least, work for detecting morphological variations, word segmentation and identification of synonymous terms. Future versions of UMLS<sup>&reg;</sup> might implement, for Spanish, the lexicographic tools that are available for the English language, making easier the automation of this preprocessing.</font></p>  	    <p align="left"><font face="verdana" size="2">Another potential difficulty that must be addressed is the presence of typographical errors, acronyms and abbreviations in the diagnostic text. A preprocessing step aimed at correcting or expanding these tokens could be necessary before a diagnosis is presented to any classifiers.</font></p>  	    <p align="left"><font face="verdana" size="2">Finally, the fact that different combinations of preprocessing/learning algorithm misclassify different diagnoses strongly suggests that a combination of the classifiers could yield an improvement in accuracy.</font></p>     <p align="left"><font face="verdana" size="3"><b>ACKNOWLEDGEMENT</b></font></p> 	    ]]></body>
<body><![CDATA[<p align="left"><font face="verdana" size="2">This work has been funded by DICYT grant 2070718 from Universidad de Santiago de Chile &#40;Usach&#41;.</font></p> 	    <p align="left"><strong><font size="3" face="Verdana">NOTES</font></strong></p> 	    <p align="left"><font face="verdana" size="2"><sup><a name="nota02"></a><a href="#n02">2</a></sup>Spanish is a language with grammatical gender. In this example, the variations MALIGNO/MALIGNA and HEPATICO/ HEPATICA are used for masculine/feminine nouns. TUMOR is a masculine noun and NEOPLASIA is a feminine one.</font></p>     <p align="left"><font face="verdana" size="3"><b>REFERENCES</b></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;1&#93; R.B. Panerai and J. Pe&ntilde;a Mohr. &#34;Evaluaci&oacute;n de tecnolog&iacute;as en salud: Metodolog&iacute;a para pa&iacute;ses en desarrollo&#34;. Organizaci&oacute;n Panamericana de la Salud. Washington D.C., U.S.A. 1990.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600001&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;2&#93; L.I. Iezzoni, editor. &#34;Risk Adjustment for Measuring Health Care Outcomes&#34;. Third edition. Health Administration Press. Chicago, </font><font face="verdana" size="2">Illinois, U.S.A. 2003.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600002&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;3&#93; A. Majeed, A.B. Bindman and J.P. Weiner. &#34;Use of risk adjustment in setting budgets and measuring performance in primary care I: how it works&#34;. British Medical Journal. Vol. 323, pp. 604&#45;607. September, 2001.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600003&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="left"><font face="verdana" size="2">&#91;4&#93; M. Chac&oacute;n, V. Rocco, E. Morgado, E. S&aacute;ez y S. Pliscoff. &#34;Identificaci&oacute;n de los determinantes de la estad&iacute;a en Unidades de Cuidados Intensivos usando redes neuronales artificiales&#34;. Revista M&eacute;dica de Chile. Vol. 130 N&deg; 1, pp. 71&#45;78. January, 2002.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600004&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;5&#93; M. Chac&oacute;n and O. Luci. &#34;Patients classification by risk using cluster analysis and genetic algorithms&#34;. In Alberto Sanfeliu and Jos&eacute; Ruiz&#45;Shulcloper, editors, Progress in Pattern Recognition, Speech and Image Analysis, Proceedings 8th Iberoamerican Congress on Pattern </font><font face="verdana" size="2">Recognition, CIARP 2003, Havana, Cuba, </font><font face="verdana" size="2">November 26&#45;29, 2003, Lecture Notes in Computer Science, volume 2905, pages 350&#45;358. Springer, February, 2003.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600005&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;6&#93; A. Majeed, A.B. Bindman and J.P. Weiner. &#34;Use of risk adjustment in setting budgets and measuring performance in primary care ii: advantages, disadvantages, and practicalities&#34;. British Medical Journal. Vol. 323, pp. 607&#45;610. September, 2001.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600006&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;7&#93; L.I. Iezzoni, J.Z. Ayanian, D.W. Bates and H.R. Burstin. &#34;Paying more fairly for Medicare capitated care&#34;. The New England Journal of Medicine. Vol. 339 N&deg; 26. 1998.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600007&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;8&#93; OPS. &#34;Clasificaci&oacute;n estad&iacute;stica internacional de enfermedades y problemas relacionados con la salud&#34;. Publicaci&oacute;n Cient&iacute;fica. Vol. 1 N&deg; 554. D&eacute;cima Revisi&oacute;n. Organizaci&oacute;n Panamericana de la Salud. Washington, </font><font face="verdana" size="2">D.C., U.S.A. 1995.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600008&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="left"><font face="verdana" size="2">&#91;9&#93; R.A. C&ocirc;t&eacute;, D.J. Rothwell, J.L. Palotay, R.S. Beckett and L. Brochu, editors. &#34;The Systemised Nomenclature of Medicine: SNOMED International&#34;. College of American Pathologists, Northfield, Illinois, </font><font face="verdana" size="2">U.S.A. 1993.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600009&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;10&#93; NLP. &#34;UMLS&reg; knowledge sources&#34;. Technical Report 15th Edition. July Release 2004AB, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, </font><font face="verdana" size="2">U.S.A. July, 2004.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600010&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;11&#93; D.T. Heinze, M.L. Morsch, R.E. Sheffer, Jr., Michelle A. Jimmink, M.A. Jennings, W.C. Morris and A.E.W. Morsch. &#34;Lifecode: A deployed application for automated medical </font><font face="verdana" size="2">coding&#34;. AI Magazine. Vol. 22 N&deg; 2. 2001.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600011&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;12&#93; C. Morales. &#34;Sistema de reconocimiento para clasificaci&oacute;n autom&aacute;tica de diagn&oacute;sticos m&eacute;dicos&#34;. Trabajo de Titulaci&oacute;n para optar </font><font face="verdana" size="2">al T&iacute;tulo de Ingeniero Civil en Inform&aacute;tica. </font><font face="verdana" size="2">Abril 2002.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600012&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;13&#93; F. Sebastiani. &#34;Machine learning in automated text categorization&#34;. ACM Computing Surveys. Vol. 34 N&deg; 1. 2002.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600013&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    ]]></body>
<body><![CDATA[<!-- ref --><p align="left"><font face="verdana" size="2">&#91;14&#93; W.W. Cohen. &#34;Fast effective rule induction&#34;. In Armand Prieditis and Stuart J. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning &#40;ICML&#45;1995&#41;. Morgan Kaufmann. Tahoe City, </font><font face="verdana" size="2">California, U.S.A. 1995.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600014&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;15&#93; S.D. Pietra, V.D. Pietra and J. Lafferty. &#34;Inducing features of random fields&#34;. IEEE Transactions Pattern Analysis and Machine Intelligence. Vol. 19 N&deg; 4. 1997.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600015&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;16&#93; J.M. Baldrige and G. Bierner. &#34;OpenNLP MAXENT&#34;. 2001. Software available at <a href="http://maxent.sourceforge.net" target="_blank">http://maxent.sourceforge.net</a></font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600016&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --><!-- ref --><p align="left"><font face="verdana" size="2">&#91;17&#93; R.&#45;E. Fan, P.&#45;H. Chen and C.&#45;J. Lin. &#34;Working set selection using the second order information for training SVM&#34;. Journal of Machine Learning Research. Vol. 6. 2005.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600017&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;18&#93; C.&#45;C. Chang and C.&#45;J. Lin. &#34;LIBSVM: A </font><font face="verdana" size="2">Library for Support Vector Machines&#34;. 2001. Software available at <a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm" target="_blank">http://www.csie.ntu.edu.tw/~cjlin/libsvm</a></font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600018&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --><!-- ref --><p align="left"><font face="verdana" size="2">&#91;19&#93; Generalized Iterative Scaling &#40;GIS&#41;. Darroch &amp; Ratcli. 1972.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600019&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    ]]></body>
<body><![CDATA[<!-- ref --><p align="left"><font face="verdana" size="2">&#91;20&#93; L. Riddick, W.B. Long, W.S. Copes, D.M. </font><font face="verdana" size="2">Dove and W.J. Sacco. &#34;Automated coding of injuries from autopsy reports&#34;. American Journal of Forensic Medicine &amp; Pathology. Vol. 19 N&deg; 3. 1998.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600020&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;21&#93; P. Franz, A. Zaiss, S. Schulz, U. Hahn and R. Klar. &#34;Automated coding of diagnoses &#45; three </font><font face="verdana" size="2">methods compared&#34;. In Proceedings of the Annual Symposium of the American Medical Informatics Association &#40;AMIA&#41;. Hanley &amp; Belfus, Inc. Philadelphia, Pennsylvania, </font><font face="verdana" size="2">U.S.A. 2000.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600021&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;22&#93; C. Friedman, L. Shagina, Y. Lussier and G. Hripcsak. &#34;Automated encoding of clinical documents based on natural language processing&#34;. Journal of the American Medical Informatics Association. Vol. 11 N&deg; 5. 2004.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600022&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;23&#93; A. Wilcox and G. Hripcsak. &#34;Classification algorithms applied to narrative reports&#34;. In Proceedings of the Annual Symposium of the American Medical Informatics Association &#40;AMIA&#41;. Hanley &amp; Belfus, Inc. Philadelphia, </font><font face="verdana" size="2">Pennsylvania, U.S.A. 1999.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600023&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;24&#93; S.V. Pakhomov, J. Buntrock and C.G. Chute. &#34;Identification of patients with congestive heart failure using a binary classifier: A case study&#34;. In S. Ananiadou and J. Tsujii, editors, Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine. Association for Computational Linguistics. </font><font face="verdana" size="2">2003.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600024&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="left"><font face="verdana" size="2">&#91;25&#93; A. Tretiakov, I. Hunter, D. Whidett and E. </font><font face="verdana" size="2">Sutinen. &#34;Coding of medical records via restrictive semantic topic tracking&#34;. Health Care and Informatics Review Online. Vol. 10 </font><font face="verdana" size="2">N&deg; 3. 2007.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600025&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="left"><font face="verdana" size="2">&#91;26&#93; D.T. Heinze, P. Feller, J. McCorkle and M. Morsch. &#34;Computer&#45;assisted Auditing for High&#45;Volume Medical Coding. Perspectives in Health Information Management&#34;. Computer Assisted Coding Conference Proceedings. 2006.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scieloOrg/php/reflinks.php?refpid=S0718-3305201100030000600026&pid=S0718-33052011000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');"></a>&#160;]<!-- end-ref --></font></p> 	<hr align="left" width="30%" size="1" noshade> 	    <p align="left"><font face="verdana" size="2"><i>Received: March 16, 2009 Accepted: November 3, 2011</i></font></p>      ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Panerai]]></surname>
<given-names><![CDATA[R.B]]></given-names>
</name>
<name>
<surname><![CDATA[Peña Mohr]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<source><![CDATA[Evaluación de tecnologías en salud: Metodología para países en desarrollo]]></source>
<year>1990</year>
<publisher-loc><![CDATA[Washington D.C. ]]></publisher-loc>
<publisher-name><![CDATA[Organización Panamericana de la Salud]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Iezzoni]]></surname>
<given-names><![CDATA[L.I]]></given-names>
</name>
</person-group>
<source><![CDATA[Risk Adjustment for Measuring Health Care Outcomes]]></source>
<year>2003</year>
<edition>Third</edition>
<publisher-loc><![CDATA[Chicago^eIllinois Illinois]]></publisher-loc>
<publisher-name><![CDATA[Health Administration Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Majeed]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Bindman]]></surname>
<given-names><![CDATA[A.B]]></given-names>
</name>
<name>
<surname><![CDATA[Weiner]]></surname>
<given-names><![CDATA[J.P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Use of risk adjustment in setting budgets and measuring performance in primary care I: how it works]]></article-title>
<source><![CDATA[British Medical Journal]]></source>
<year>Sept</year>
<month>em</month>
<day>be</day>
<volume>323</volume>
<page-range>604-607</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chacón]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Rocco]]></surname>
<given-names><![CDATA[V]]></given-names>
</name>
<name>
<surname><![CDATA[Morgado]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Sáez]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Pliscoff]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<article-title xml:lang="es"><![CDATA[Identificación de los determinantes de la estadía en Unidades de Cuidados Intensivos usando redes neuronales artificiales]]></article-title>
<source><![CDATA[Revista Médica de Chile]]></source>
<year>Janu</year>
<month>ar</month>
<day>y,</day>
<volume>130</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>71-78</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chacón]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Luci]]></surname>
<given-names><![CDATA[O]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Patients classification by risk using cluster analysis and genetic algorithms]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Sanfeliu]]></surname>
<given-names><![CDATA[Alberto]]></given-names>
</name>
<name>
<surname><![CDATA[Ruiz-Shulcloper]]></surname>
<given-names><![CDATA[José]]></given-names>
</name>
</person-group>
<source><![CDATA[Progress in Pattern Recognition, Speech and Image Analysis]]></source>
<year>Febr</year>
<month>ua</month>
<day>ry</day>
<volume>2905</volume>
<conf-name><![CDATA[ Proceedings 8th Iberoamerican Congress on Pattern Recognition, CIARP 2003]]></conf-name>
<conf-date>November 26-29, 2003</conf-date>
<conf-loc>Havana </conf-loc>
<page-range>350-358</page-range><publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Majeed]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Bindman]]></surname>
<given-names><![CDATA[A.B]]></given-names>
</name>
<name>
<surname><![CDATA[Weiner]]></surname>
<given-names><![CDATA[J.P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Use of risk adjustment in setting budgets and measuring performance in primary care ii: advantages, disadvantages, and practicalities]]></article-title>
<source><![CDATA[British Medical Journal]]></source>
<year>Sept</year>
<month>em</month>
<day>be</day>
<volume>323</volume>
<page-range>607-610</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Iezzoni]]></surname>
<given-names><![CDATA[L.I]]></given-names>
</name>
<name>
<surname><![CDATA[Ayanian]]></surname>
<given-names><![CDATA[J.Z]]></given-names>
</name>
<name>
<surname><![CDATA[Bates]]></surname>
<given-names><![CDATA[D.W]]></given-names>
</name>
<name>
<surname><![CDATA[Burstin]]></surname>
<given-names><![CDATA[H.R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Paying more fairly for Medicare capitated care]]></article-title>
<source><![CDATA[The New England Journal of Medicine]]></source>
<year>1998</year>
<volume>339</volume>
<numero>26</numero>
<issue>26</issue>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="journal">
<collab>OPS</collab>
<article-title xml:lang="es"><![CDATA[Clasificación estadística internacional de enfermedades y problemas relacionados con la salud]]></article-title>
<source><![CDATA[Publicación Científica]]></source>
<year>1995</year>
<volume>1</volume>
<numero>554</numero>
<issue>554</issue>
<publisher-loc><![CDATA[Washington, D.C ]]></publisher-loc>
<publisher-name><![CDATA[Organización Panamericana de la Salud]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Côté]]></surname>
<given-names><![CDATA[R.A]]></given-names>
</name>
<name>
<surname><![CDATA[Rothwell]]></surname>
<given-names><![CDATA[D.J]]></given-names>
</name>
<name>
<surname><![CDATA[Palotay]]></surname>
<given-names><![CDATA[J.L]]></given-names>
</name>
<name>
<surname><![CDATA[Beckett]]></surname>
<given-names><![CDATA[R.S]]></given-names>
</name>
<name>
<surname><![CDATA[Brochu]]></surname>
<given-names><![CDATA[L]]></given-names>
</name>
</person-group>
<source><![CDATA[The Systemised Nomenclature of Medicine: SNOMED International]]></source>
<year>1993</year>
<publisher-loc><![CDATA[Northfield^eIllinois Illinois]]></publisher-loc>
<publisher-name><![CDATA[College of American Pathologists]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="book">
<collab>NLP</collab>
<source><![CDATA[UMLS® knowledge sources: Technical Report]]></source>
<year>July</year>
<month>, </month>
<day>20</day>
<edition>15</edition>
<publisher-loc><![CDATA[Bethesda^eMD MD]]></publisher-loc>
<publisher-name><![CDATA[U.S. National Library of Medicine]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Heinze]]></surname>
<given-names><![CDATA[D.T]]></given-names>
</name>
<name>
<surname><![CDATA[Morsch]]></surname>
<given-names><![CDATA[M.L]]></given-names>
</name>
<name>
<surname><![CDATA[Sheffer, Jr]]></surname>
<given-names><![CDATA[R.E]]></given-names>
</name>
<name>
<surname><![CDATA[Jimmink]]></surname>
<given-names><![CDATA[Michelle A]]></given-names>
</name>
<name>
<surname><![CDATA[Jennings]]></surname>
<given-names><![CDATA[M.A]]></given-names>
</name>
<name>
<surname><![CDATA[Morris]]></surname>
<given-names><![CDATA[W.C]]></given-names>
</name>
<name>
<surname><![CDATA[Morsch]]></surname>
<given-names><![CDATA[A.E.W]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Lifecode: A deployed application for automated medical coding]]></article-title>
<source><![CDATA[AI Magazine]]></source>
<year>2001</year>
<volume>22</volume>
<numero>2</numero>
<issue>2</issue>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Morales]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
</person-group>
<source><![CDATA[Sistema de reconocimiento para clasificación automática de diagnósticos médicos]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sebastiani]]></surname>
<given-names><![CDATA[F]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Machine learning in automated text categorization]]></article-title>
<source><![CDATA[ACM Computing Surveys]]></source>
<year>2002</year>
<volume>34</volume>
<numero>1</numero>
<issue>1</issue>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cohen]]></surname>
<given-names><![CDATA[W.W]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Fast effective rule induction]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Prieditis]]></surname>
<given-names><![CDATA[Armand]]></given-names>
</name>
<name>
<surname><![CDATA[Russell]]></surname>
<given-names><![CDATA[Stuart J]]></given-names>
</name>
</person-group>
<source><![CDATA[]]></source>
<year>1995</year>
<conf-name><![CDATA[ Proceedings of the Twelfth International Conference on Machine Learning (ICML-1995)]]></conf-name>
<conf-loc> </conf-loc>
<publisher-loc><![CDATA[Tahoe City^eCalifornia California]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pietra]]></surname>
<given-names><![CDATA[S.D]]></given-names>
</name>
<name>
<surname><![CDATA[Pietra]]></surname>
<given-names><![CDATA[V.D]]></given-names>
</name>
<name>
<surname><![CDATA[Lafferty]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Inducing features of random fields]]></article-title>
<source><![CDATA[IEEE Transactions Pattern Analysis and Machine Intelligence]]></source>
<year>1997</year>
<volume>19</volume>
<numero>4</numero>
<issue>4</issue>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Baldrige]]></surname>
<given-names><![CDATA[J.M]]></given-names>
</name>
<name>
<surname><![CDATA[Bierner]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
</person-group>
<source><![CDATA[OpenNLP MAXENT]]></source>
<year>2001</year>
</nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Fan]]></surname>
<given-names><![CDATA[R.-E]]></given-names>
</name>
<name>
<surname><![CDATA[Chen]]></surname>
<given-names><![CDATA[P.-H]]></given-names>
</name>
<name>
<surname><![CDATA[Lin]]></surname>
<given-names><![CDATA[C.-J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Working set selection using the second order information for training SVM]]></article-title>
<source><![CDATA[Journal of Machine Learning Research]]></source>
<year>2005</year>
<volume>6</volume>
</nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chang]]></surname>
<given-names><![CDATA[C.-C]]></given-names>
</name>
<name>
<surname><![CDATA[Lin]]></surname>
<given-names><![CDATA[C.-J]]></given-names>
</name>
</person-group>
<source><![CDATA[LIBSVM: A Library for Support Vector Machines]]></source>
<year>2001</year>
</nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="book">
<source><![CDATA[Generalized Iterative Scaling (GIS)]]></source>
<year>1972</year>
<publisher-name><![CDATA[Darroch & Ratcli]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Riddick]]></surname>
<given-names><![CDATA[L]]></given-names>
</name>
<name>
<surname><![CDATA[Long]]></surname>
<given-names><![CDATA[W.B]]></given-names>
</name>
<name>
<surname><![CDATA[Copes]]></surname>
<given-names><![CDATA[W.S]]></given-names>
</name>
<name>
<surname><![CDATA[Dove]]></surname>
<given-names><![CDATA[D.M]]></given-names>
</name>
<name>
<surname><![CDATA[Sacco]]></surname>
<given-names><![CDATA[W.J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Automated coding of injuries from autopsy reports]]></article-title>
<source><![CDATA[American Journal of Forensic Medicine & Pathology]]></source>
<year>1998</year>
<volume>19</volume>
<numero>3</numero>
<issue>3</issue>
</nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Franz]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Zaiss]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Schulz]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Hahn]]></surname>
<given-names><![CDATA[U]]></given-names>
</name>
<name>
<surname><![CDATA[Klar]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Automated coding of diagnoses: three methods compared]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proceedings of the Annual Symposium of the American Medical Informatics Association (AMIA)]]></conf-name>
<conf-date>2000</conf-date>
<conf-loc>Philadelphia Pennsylvania</conf-loc>
</nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Friedman]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
<name>
<surname><![CDATA[Shagina]]></surname>
<given-names><![CDATA[L]]></given-names>
</name>
<name>
<surname><![CDATA[Lussier]]></surname>
<given-names><![CDATA[Y]]></given-names>
</name>
<name>
<surname><![CDATA[Hripcsak]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Automated encoding of clinical documents based on natural language processing]]></article-title>
<source><![CDATA[Journal of the American Medical Informatics Association]]></source>
<year>2004</year>
<volume>11</volume>
<numero>5</numero>
<issue>5</issue>
</nlm-citation>
</ref>
<ref id="B23">
<label>23</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Wilcox]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Hripcsak]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Classification algorithms applied to narrative reports]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proceedings of the Annual Symposium of the American Medical Informatics Association (AMIA)]]></conf-name>
<conf-date>1999</conf-date>
<conf-loc>Philadelphia Pennsylvania</conf-loc>
</nlm-citation>
</ref>
<ref id="B24">
<label>24</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pakhomov]]></surname>
<given-names><![CDATA[S.V]]></given-names>
</name>
<name>
<surname><![CDATA[Buntrock]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Chute]]></surname>
<given-names><![CDATA[C.G]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Identification of patients with congestive heart failure using a binary classifier: A case study]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Ananiadou]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Tsujii]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine]]></conf-name>
<conf-date>2003</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B25">
<label>25</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Tretiakov]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Hunter]]></surname>
<given-names><![CDATA[I]]></given-names>
</name>
<name>
<surname><![CDATA[Whidett]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Sutinen]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Coding of medical records via restrictive semantic topic tracking]]></article-title>
<source><![CDATA[Health Care and Informatics Review Online]]></source>
<year>2007</year>
<volume>10</volume>
<numero>3</numero>
<issue>3</issue>
</nlm-citation>
</ref>
<ref id="B26">
<label>26</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Heinze]]></surname>
<given-names><![CDATA[D.T]]></given-names>
</name>
<name>
<surname><![CDATA[Feller]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[McCorkle]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Morsch]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Computer-assisted Auditing for High-Volume Medical Coding: Perspectives in Health Information Management]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Computer Assisted Coding Conference Proceedings]]></conf-name>
<conf-date>2006</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
