next up previous
Next: Clustering the datasets Up: On the Relative Hardness Previous: Datasets


Calculating the relative hardness of a corpus

In order to determine the Relative Hardness (RH) of a given corpus, we have considered the vocabulary overlapping among the texts of the corpus. In our experiments, we have used the well-known Jaccard coefficient for calculating the overlapping. We considered all the possible combinations of more than two categories from the corpus and for each of them we calculated its RH. For instance, for a given corpus of $ n$ categories, $ 2^n - (n+1)$ possible subcorpora will be obtained: e.g. for the R8 (eight categories) we obtained $ 247$ subsets.

Thereafter, we calculated their RHs as follows: given a corpus $ C_i$ made up of $ n$ categories (CAT), the RH of $ C_i=\{CAT_1, CAT_2, ..., CAT_n\}$ is:

$\displaystyle RH(C_i) = \frac{1}{n(n-1)/2} \times \sum_{j,k=1; j<k}^{n}{Similarity(CAT_j, CAT_k)},$ (1)

where the similarity among categories is obtained by using the Jaccard coefficient in order to determine their overlapping (see Eq. (2)). However, more sophisticated measures also could be used, such as the one presented in [8] in the plagiarism degree calculation framework.

$\displaystyle Similarity(CAT_j, CAT_k) = \frac{\vert CAT_j \bigcap CAT_k\vert}{\vert CAT_j \bigcup CAT_k\vert}$ (2)

In the above formula we have considered each category $ j$ as the ``document'' obtained by concatenating all the documents belonging to the category $ j$.


next up previous
Next: Clustering the datasets Up: On the Relative Hardness Previous: Datasets
David Pinto 2007-10-05