Next: Feature Selection Techniques
Up: A Comparative Study of
Previous: A Comparative Study of
Clustering on Narrow Domain
Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Clustering
abstracts of very narrow domains is a very challenging task that has been few treated by the computational linguistic community. The main
objective of this area is to classify scientific documents; moreover, this area proposes to detect emerging study fields by using unsupervised clustering methods. It is well known that clustering methods
rely their performance upon the preprocessing step applied to the corpus. In this way, a good technique for selecting a subset of the
terms that appear in each scientific paper is needed. However, current keyword-based techniques fail on narrow domain-oriented libraries;
this fact is
derived from the high terms overlapping in the abstracts and the high number of typical words used in those abstracts,
like "In this paper we present...". Some approaches have been given for this new task; their proposals are mainly focused on the selection
of an good technique for extracting terms from the vocabulary of each abstract. Makagonov04, by instance, proposed simple
procedures for improving results by an adequate selection of keywords and a better evaluation of document similarity. Another work in this context is presented in [Alexandrov, Gelbukh, and Rosso2005], where an approach for clustering abstracts in a narrow domain using Stein's MajorClust Method for
clustering both, keywords and documents, was presented. Despite the small size of the collection, an interesting work was presented in [Jiménez, Pinto, and Rosso2005b],
where a new technique for keyword selection was proposed; they used also this new technique in the evaluation of a bigger size
corpus [Pinto, Jiménez-Salazar, and Rosso2006]. Their results have motivated this comparative study. Therefore, we are interested in verifying whether this new technique
could be capable of improving results obtained in feature selection environment. The remaining of this paper is distributed in the following way: first we introduce the
feature selection techniques used by Pinto. The next section describes the experiment we carried out, first by introducing
the dataset used, and then we present a complete description of the comparative study. The section 3 shows the experimental
results carried out, and finally discussion about findings is given.
Next: Feature Selection Techniques
Up: A Comparative Study of
Previous: A Comparative Study of
David Pinto
2006-05-25