On the Relative Hardness
of Clustering Corpora
David Pinto(1,2), Paolo Rosso1
1Department of Information Systems and Computation,
Polytechnic University of Valencia, Spain
2Faculty of Computer Science,
B. Autonomous University of Puebla, Mexico
Clustering is often considered the most important unsupervised
learning problem and several clustering
algorithms have been proposed over the years. Many of these algorithms
have been tested on classical
clustering corpora such as Reuters and 20 Newsgroups in order to
determine their quality. However, up to now the
relative hardness of those corpora has not been determined.
The relative clustering hardness of a given corpus may be of high
interest, since it would help
to determine whether the usual corpora used to benchmark the clustering
algorithms are hard enough.
Moreover, if it is possible to find a set of features involved in the
hardness of the clustering task itself,
specific clustering techniques may be used instead of general ones in
order to improve the quality of the
In this paper, we are presenting a study of the specific feature of the
vocabulary overlapping among documents of a given corpus. Our
preliminary experiments were carried out on three different corpora:
the train and test version of the R8 subset of the Reuters collection
and a reduced version of the 20 Newsgroups (Mini20Newsgroups).
We figured out that a possible relation between the vocabulary
overlapping and the
F-Measure may be introduced.