Next: The Transition Point Technique Up: Clustering Abstracts of Scientific Previous: Clustering Abstracts of Scientific

Introduction

Nowadays, very short text clustering on narrow domains has not received too much attention by the computational linguistic community. This is derived from the high challenge that this problem implies, since the obtained results are very unstable or imprecise when clustering abstracts of scientific papers, technical reports, patents, etc. But, as we can see, most digital libraries and other web-based repositories of scientific and technical information nowadays provide free access only to abstracts and not to the full texts of the documents. Moreover, some institutions, like the well known CERN, receive hundreds of publications every day that must be categorized on some specific domain with an unknown number of categories. This led to construct novel methods for treating this real problem.

Clustering of very short texts implies to deal with very low frequencies; moreover, if this kind of texts belong to scientific papers, the difficulty increases, due to the continue use of some words like, for instance: ``in this paper we present...'', etc.; as a matter of fact, in [1], it is said that:

When we deal with documents from one given domain, the situation is cardinally different. All clusters to be revealed have strong intersections of their vocabularies and the difference between them consists not in the set of index keywords but in their proportion. This causes very unstable and thus very imprecise results when one works with short documents, because of very low absolute frequency of occurrence of the keywords in the texts. Usually only 10% or 20% of the keywords from the complete keyword list occur in every document and their absolute frequency usually is 1 or 2, sometimes 3 or 4. In this situation, changing a keywords frequency by 1 can significantly change the clustering results.

Some related work was presented in [9], where simple procedures in order to improve results by an adequate selection of keywords and a better evaluation of document similarity was proposed. The authors used as corpora two collections retrieved from the Web. The first collection was composed by a set of 48 abstracts (40 Kb) from the CICLing 2002 conference; the second collection was composed by 200 abstracts (215 Kb) from the IFCS-2000 conference. The main goal in this paper was to stabilize results in this kind of task; a 10% of differences among different clustering methods were obtained, taking into account different broadness of the domain and combined measures.

In [1] an approach for clustering abstracts in a narrow domain using Stein's MajorClust Method for clustering both keywords and documents was presented. Here, Alexandrov et al. used the criterion introduced in [8] in order to perform the word selection process. The authors based their experiments on the first CICLing collection used by Makagonov et al. [9], and they succeeded in improving those results. In the final discussion, Alexandrov et al. stated that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; moreover, they suggested that, for an open access via Internet, digital libraries should provide document images of full texts for the papers and not only abstracts.

More recently, in [6] a third experiment with the CICLing collection was carried out. In this paper, a novel method for keyword selection was proposed, claiming improving results on clustering abstracts of that collection. Jiménez-Salazar et al. based their comparisons with different mechanisms of term selection by using the evaluation of feature selection employed in the text categorization task [7].

After reviewing these works, we have observed that the feature selection process is the key of the clustering of abstracts task for narrow domains. Moreover, a bigger collection of abstracts is needed in order to confirm previously obtained results. In the following Section we present a brief description of the Transition Point technique. The third Section describes the term selection methods used in the experiments we carried out. The fourth Section shows the data set and the performance measure formulas used. A comparison of the results obtained is presented in Section five. Finally, the conclusions of our experiments are given.

Next: The Transition Point Technique Up: Clustering Abstracts of Scientific Previous: Clustering Abstracts of Scientific

David Pinto 2006-05-25