Further work

As future work, we need to investigate the correlation between the relative hardness and the F-Measure also on the Mini20Newsgroups dataset. Moreover, we are interested in evaluate both, the vocabulary overlapping and the term frequencies. This will allow us to further investigate whether the use of the tf-idf formula in the same context improves the current results or not. Besides, we would like to investigate the possible relationship the RH-Measure could have with cluster validity measures, such as the Density Expected Measure (DEM) which quantifies the similarity within clusters [9]. Moreover, we plan to determine the correlation between RH-Measure and the F-Measure through rank correlation coefficients such as Spearman's and Kendall's ones [4]. The final aim of this research work is to determine the level of hardness of a narrow-domain corpus, such as hep-ex [10], from a clustering task perspective.

