next up previous
Next: Calculating the relative hardness Up: On the Relative Hardness Previous: Introduction


Datasets

The preliminary experiments were carried out by using three different corpora: the R8 version of the Reuters collection (train and test) and, partially, a reduced version of the 20 Newsgroups named ``Mini20Newsgroups''. We have pre-processed each corpus eliminating punctuation symbols, stopwords and, thereafter, applying the Porter stemmer. The characteristics of each corpus after the pre-processing are given in Table 1.


Table 1: Characteristics of Reuters-R8 and Mini20Newsgroups
  R8-Train R8-Test Mini20Newsgroups
Size $ \approx$2,500 KBytes $ \approx$900 KBytes $ \approx$1,900 KBytes
Documents 5,839 2,319 2,000
Categories 8 8 20



David Pinto 2007-10-05