next up previous
Next: Correlation between relative hardness Up: On the Relative Hardness Previous: Calculating the relative hardness

Clustering the datasets

In order to evaluate the relative hardness formula used in the experiments, we have carried out an unsupervised clustering of all the documents of each subcorpus obtained for each dataset. We have chosen the MajorClust clustering algorithm [7] due to its peculiarity of taking into account both, the inside and outside similarities among the clusters obtained during its execution. In order to keep independent the validation with respect to RH, we have used the tf-idf formula for calculating the input similarity matrix for MajorClust. Each evaluation was performed with the F-Measure formula which is calculated as follows: given a set of clusters $ \{G_1,\ldots,G_m\}$ and a set of classes $ \{C_1,\ldots,C_n\}$, the $ F$-measure between a cluster $ i$ and a class $ j$ is given by the following formula.

$\displaystyle F_{ij}=\frac{2\cdot P_{ij}\cdot R_{ij}}{P_{ij}+R_{ij}},$ (3)

where $ 1\le i\le m$, $ 1\le j\le n$. $ P_{ij}$ and $ R_{ij}$ are defined as follows:

$\displaystyle P_{ij}=\frac{\mbox{Number of texts from cluster }i\mbox{ in class }j} {\mbox{Number of texts from cluster }i},$ (4)

and

$\displaystyle R_{ij}=\frac{\mbox{Number of texts from cluster }i\mbox{ in class }j} {\mbox{Number of texts in class }j}.$ (5)

The global performance of a clustering method is calculated by using the values of $ F_{ij}$, the cardinality of the set of clusters obtained, and normalising by the total number of documents in the collection ($ \vert D\vert$). The obtained measure is named $ F$-measure and it is shown in Equation (6).

$\displaystyle F=\sum_{1\le i\le m}\frac{\vert G_i\vert}{\vert D\vert}\max_{1\le j\le n}F_{ij}.$ (6)


next up previous
Next: Correlation between relative hardness Up: On the Relative Hardness Previous: Calculating the relative hardness
David Pinto 2007-10-05