Next: Clustering the datasets Up: On the Relative Hardness Previous: Datasets

Calculating the relative hardness of a corpus

In order to determine the Relative Hardness (RH) of a given corpus, we have considered the vocabulary overlapping among the texts of the corpus. In our experiments, we have used the well-known Jaccard coefficient for calculating the overlapping. We considered all the possible combinations of more than two categories from the corpus and for each of them we calculated its RH. For instance, for a given corpus of

categories,

possible subcorpora will be obtained: e.g. for the R8 (eight categories) we obtained

subsets.

Thereafter, we calculated their RHs as follows: given a corpus made up of categories (CAT), the RH of $C_i=\{CAT_1, CAT_2, ..., CAT_n\}$ is:

$\displaystyle RH(C_i) = \frac{1}{n(n-1)/2} \times \sum_{j,k=1; j<k}^{n}{Similarity(CAT_j, CAT_k)},$

(1)

where the similarity among categories is obtained by using the Jaccard coefficient in order to determine their overlapping (see Eq. (2)). However, more sophisticated measures also could be used, such as the one presented in [8] in the plagiarism degree calculation framework.

$\displaystyle Similarity(CAT_j, CAT_k) = \frac{\vert CAT_j \bigcap CAT_k\vert}{\vert CAT_j \bigcup CAT_k\vert}$

(2)

In the above formula we have considered each category as the ``document'' obtained by concatenating all the documents belonging to the category .

Next: Clustering the datasets Up: On the Relative Hardness Previous: Datasets

David Pinto 2007-10-05