next up previous
Next: Conclusions Up: On the Relative Hardness Previous: Clustering the datasets


Correlation between relative hardness and F-Measure

Our preliminary experiments were carried out on the train and test version of the Reuters R8 collection and, partially, also on a reduced version of the 20 Newsgroups. In Figure 1 we may see the possible correlation between the relative hardness of the (i) train and (ii) test versions of the R8 collection with respect to the F-Measure obtained by using the MajorClust clustering algorithm. The smaller is the value of RH (x-axis) the higher is the obtained F-Measure (y-axis) and viceversa for both corpora. The relative hardness vs. $ F$-measure correlation was calculated for all possible corpora variants of R8 (247). In order to easily visualise the correlation between RH and F-Measure, we have plotted the polynomial approximation of degree one.

Figure 1: Evaluation of all R8 subcorpora (more than two categories per corpus)
\includegraphics[width=6cm,clip]{R8TrainX.eps} \includegraphics[width=6cm,clip]{R8TestX.eps}
(a) Train (b) Test

In Figure 2 we may see the possible correlation between the relative hardness of each pair of categories of the R8 collection and the F-Measure obtained again by using the MajorClust clustering algorithm. The same conclusion is obtained: the smaller is the value of RH (x-axis) the higher is the obtained F-Measure (y-axis) and viceversa.

Figure 2: Evaluation of single pairs of the R8 categories
\includegraphics[width=6cm,clip]{R8Train2CatX.eps} \includegraphics[width=6cm,clip]{R8Test2CatX.eps}
(a) Train (b) Test

In order to fully appreciate the RH formula, the most and least related pairs of categories for the R8 dataset are presented in Tables 2 and 3, respectively. The RH value associated with each pair was calculated with the same formula presented in Section 3. Some preliminary experiments were carried out also with the Mini20Newsgroups dataset and the most and least related pairs of categories are shown in Tables 4 and 5, respectively.


Table 2: The most related categories of the R8 collection
RH value Category Category
0.426 trade monex-fx
0.399 monex-fx interest
0.367 trade crude
0.362 monex-fx crude
0.352 trade interest
RH value Category Category
0.419 monex-fx interest
0.364 trade monex-fx
0.332 trade interest
0.317 trade crude
0.311 monex-fx crude
(a) Train (b) Test


Table 3: The least related categories of the R8 collection
RH value Category Category
0.188 interest earn
0.180 acq ship
0.173 ship earn
0.153 grain acq
0.147 grain earn
RH value Category Category
0.186 interest acq
0.154 ship earn
0.147 acq ship
0.128 grain earn
0.111 grain acq
(a) Train (b) Test


Table 4: The most related categories of the Mini20Newsgroups collection
RH value Category Category
0.3412 talk politics guns talk politics misc
0.3170 alt atheism talk religion misc
0.3092 talk politics guns talk religion misc
0.3052 talk politics misc talk religion misc
0.3041 soc religion christian talk religion misc
0.2988 sci crypt talk politics guns
0.2985 soc religion christian talk politics misc
0.2958 soc religion christian talk politics guns
0.2932 talk politics mideast talk politics misc
0.2905 sci electronics sci space
0.2868 comp sys ibm pc hardware comp sys mac hardware


Table 5: The least related categories of the Mini20Newsgroups collection
RH value Category Category
0.1814 comp os mswindows misc rec sport hockey
0.1807 misc forsale talk politics misc
0.1804 misc forsale talk religion misc
0.1803 comp sys ibm pc hardware talk politics mideast
0.1798 comp os mswindows misc talk religion misc
0.1789 alt atheism comp os mswindows misc
0.1767 alt atheism misc forsale
0.1751 misc forsale soc religion christian
0.1737 comp os mswindows misc soc religion christian
0.1697 misc forsale talk politics mideast
0.1670 comp os mswindows misc talk politics mideast


next up previous
Next: Conclusions Up: On the Relative Hardness Previous: Clustering the datasets
David Pinto 2007-10-05