Next: Datasets Up: On the Relative Hardness Previous: On the Relative Hardness

Introduction

Clustering deals with finding a structure in a collection of unlabeled data [2]. When dealing with raw text corpora, the discovering of the most appropiate features can help on the selection of methods and techniques for determining the possible intrinsic grouping in those sets of unlabeled data. Therefore, this study would be of high benefit. As far as we know, research works in this field nearly have not been carried out in literature. We found just one attempt for determining the relative hardness of the Reuters-21578 clustering collection [1], but this research work neither derived formulae for determining the hardness of these corpora nor the possible set of features that are involved in the clustering hardness. A related work which could be considered in order to observe the hardness of a given corpus (with respect to a specific clustering algorithm) is partially presented in [3] and [4]. In these research works, the author discusses internal clustering quality measures, such as the one from the Dunn Index family, which showed to perform well in the experiments presented by Bezdek et al. in [5,6], among others.

Reuters-21578 (now Reuters RCV1 and RCV2) and 20 Newsgroups are well-known collections which have been used for benchmarking clustering algorithms. However, the fact that several clustering methods may obtain bad results over those corpora does not necessarily imply that they are difficult to be clustered. Further investigation needs to be done in order to determine whether the current clustering corpora are easy clustering instances or not.

We are interested in investigating two aspects: a set of possible features hypothetically related with the hardness of the clustering task, as well as the definition of a formula for the easy evaluation of the relative hardness of a given clustering corpus. We empirically know that at least three components are involved: (i) the size of the clustering texts, (ii) the broadness of the corpora domain and, (iii) whether the documents are single or multi categorized. In the our preliminary experiments, we have investigated the possible relationship between the vocabulary overlapping of a given text corpus with its F-Measure obtained with the MajorClust clustering algorithm [7].

The rest of this paper is structured as follows. In Section 2 we briefly describe the main characteristics of the corpora used in our preliminary experiments. In Section 3 we introduce the used formula and the employed approach to split the corpus in order to calculate the relative hardness for all the possible combinations of two or more categories. Section 5 shows the experimental results we obtained. Finally, conclusions are drawn and the necessary further work to be done is discussed.

Next: Datasets Up: On the Relative Hardness Previous: On the Relative Hardness

David Pinto 2007-10-05