Reuters-21578 (now Reuters RCV1 and RCV2) and 20 Newsgroups are well-known collections which have been used for benchmarking clustering algorithms. However, the fact that several clustering methods may obtain bad results over those corpora does not necessarily imply that they are difficult to be clustered. Further investigation needs to be done in order to determine whether the current clustering corpora are easy clustering instances or not.
We are interested in investigating two aspects: a set of possible features hypothetically related with the hardness of the clustering task, as well as the definition of a formula for the easy evaluation of the relative hardness of a given clustering corpus. We empirically know that at least three components are involved: (i) the size of the clustering texts, (ii) the broadness of the corpora domain and, (iii) whether the documents are single or multi categorized. In the our preliminary experiments, we have investigated the possible relationship between the vocabulary overlapping of a given text corpus with its F-Measure obtained with the MajorClust clustering algorithm [7].
The rest of this paper is structured as follows. In Section 2 we briefly describe the main characteristics of the corpora used in our preliminary experiments. In Section 3 we introduce the used formula and the employed approach to split the corpus in order to calculate the relative hardness for all the possible combinations of two or more categories. Section 5 shows the experimental results we obtained. Finally, conclusions are drawn and the necessary further work to be done is discussed.