SciELO - Scientific Electronic Library Online

vol.40 número63Desarrollo y transferencia de estrategias de producción escrita índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados




Links relacionados


Revista signos

versión On-line ISSN 0718-0934


VENEGAS, René. Academic text classification based on lexical-semantic content. Rev. signos [online]. 2007, vol.40, n.63, pp.239-271. ISSN 0718-0934.

The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus belonging to the Fondecyt 1060440 research project. The methods are based on shared lexical-semantic content words present in a corpus of academic texts used in four professional carriers at the Pontificia Universidad Católica de Valparaíso, Chile. The research corpus, nowadays, is constituted by 652 texts with 96.288.874 words. For our purposes, we use a sample of 216 texts (30.886.081 words) divided, as following: 26 used in Construction Engineering, 31 used in Chemistry, 64 used Social Work, and 95 used in Psychology. The classification methods compared in this research are Multinomial Naïve Bayes and Support Vector Machine, both permits to identify a small group of shared words that permit, according statistical weights, to classify a new text into the four disciplinary areas. The results allows us to establish that Support Vector Machine classify in a efficient way academic texts, with high precision and recall values. With this method we are able to identify automatically the disciplinary domain, with a high percentage of accuracy (93,9%), of a new academic text in a query. We project to use this method as part of a more detailed multidimensional analysis of the PUCV-2006 Corpus

Palabras clave : Academic discourse; vectorial model; Naïve Bayes; Support Vector Machine.

        · resumen en Español     · texto en Español


Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons