Next: Calculating the relative hardness
Up: On the Relative Hardness
Previous: Introduction
Datasets
The preliminary experiments were carried out by using three different corpora: the R8 version of the Reuters collection (train and test) and, partially, a reduced version of the 20 Newsgroups named ``Mini20Newsgroups''. We have pre-processed each corpus eliminating punctuation symbols, stopwords and, thereafter, applying the Porter stemmer. The characteristics of each corpus after the pre-processing are given in Table 1.
Table 1:
Characteristics of Reuters-R8 and Mini20Newsgroups
|
R8-Train |
R8-Test |
Mini20Newsgroups |
Size |
2,500 KBytes |
900 KBytes |
1,900 KBytes |
Documents |
5,839 |
2,319 |
2,000 |
Categories |
8 |
8 |
20 |
David Pinto
2007-10-05