In Figure 2 we may see the possible correlation between the relative hardness of each pair of categories of the R8 collection and the F-Measure obtained again by using the MajorClust clustering algorithm. The same conclusion is obtained: the smaller is the value of RH (x-axis) the higher is the obtained F-Measure (y-axis) and viceversa.
In order to fully appreciate the RH formula, the most and least related pairs of categories for the R8 dataset are presented in Tables 2 and 3, respectively. The RH value associated with each pair was calculated with the same formula presented in Section 3. Some preliminary experiments were carried out also with the Mini20Newsgroups dataset and the most and least related pairs of categories are shown in Tables 4 and 5, respectively.
|
|
||||||||||||||||||||||||||||||||||||
(a) Train | (b) Test |
RH value | Category | Category |
0.3412 | talk politics guns | talk politics misc |
0.3170 | alt atheism | talk religion misc |
0.3092 | talk politics guns | talk religion misc |
0.3052 | talk politics misc | talk religion misc |
0.3041 | soc religion christian | talk religion misc |
0.2988 | sci crypt | talk politics guns |
0.2985 | soc religion christian | talk politics misc |
0.2958 | soc religion christian | talk politics guns |
0.2932 | talk politics mideast | talk politics misc |
0.2905 | sci electronics | sci space |
0.2868 | comp sys ibm pc hardware | comp sys mac hardware |
RH value | Category | Category |
0.1814 | comp os mswindows misc | rec sport hockey |
0.1807 | misc forsale | talk politics misc |
0.1804 | misc forsale | talk religion misc |
0.1803 | comp sys ibm pc hardware | talk politics mideast |
0.1798 | comp os mswindows misc | talk religion misc |
0.1789 | alt atheism | comp os mswindows misc |
0.1767 | alt atheism | misc forsale |
0.1751 | misc forsale | soc religion christian |
0.1737 | comp os mswindows misc | soc religion christian |
0.1697 | misc forsale | talk politics mideast |
0.1670 | comp os mswindows misc | talk politics mideast |