Next: Feature Selection Techniques Up: A Comparative Study of Previous: A Comparative Study of

Clustering on Narrow Domain

Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Clustering abstracts of very narrow domains is a very challenging task that has been few treated by the computational linguistic community. The main objective of this area is to classify scientific documents; moreover, this area proposes to detect emerging study fields by using unsupervised clustering methods. It is well known that clustering methods rely their performance upon the preprocessing step applied to the corpus. In this way, a good technique for selecting a subset of the terms that appear in each scientific paper is needed. However, current keyword-based techniques fail on narrow domain-oriented libraries; this fact is derived from the high terms overlapping in the abstracts and the high number of typical words used in those abstracts, like "In this paper we present...". Some approaches have been given for this new task; their proposals are mainly focused on the selection of an good technique for extracting terms from the vocabulary of each abstract. Makagonov04, by instance, proposed simple procedures for improving results by an adequate selection of keywords and a better evaluation of document similarity. Another work in this context is presented in [Alexandrov, Gelbukh, and Rosso2005], where an approach for clustering abstracts in a narrow domain using Stein's MajorClust Method for clustering both, keywords and documents, was presented. Despite the small size of the collection, an interesting work was presented in [Jiménez, Pinto, and Rosso2005b], where a new technique for keyword selection was proposed; they used also this new technique in the evaluation of a bigger size corpus [Pinto, Jiménez-Salazar, and Rosso2006]. Their results have motivated this comparative study. Therefore, we are interested in verifying whether this new technique could be capable of improving results obtained in feature selection environment. The remaining of this paper is distributed in the following way: first we introduce the feature selection techniques used by Pinto. The next section describes the experiment we carried out, first by introducing the dataset used, and then we present a complete description of the comparative study. The section 3 shows the experimental results carried out, and finally discussion about findings is given.

Next: Feature Selection Techniques Up: A Comparative Study of Previous: A Comparative Study of

David Pinto 2006-05-25