next up previous
Next: Performance Measurement Up: Clustering of Abstracts in Previous: Clustering of Abstracts in

Data Set

In our experiments we used two corpora based on the collection of abstracts compiled and provided to us by the University of Jaén, Spain [11], named hep-ex. The first corpus was built by extracting a subset of documents from the full collection. We used the full collection as a second corpus, which is composed by 2,922 abstracts from the Physics domain originally stored in CERN[*]. The distribution obtained for both corpora is shown in Table 1. The distribution of the categories for each corpus is better described in Table 2.


Table 1: Collections (preprocessed) features
Feature Subset of hep-ex Full collection hep-ex
Size of the corpus (bytes) 165,349 962,802
Number of categories 7 9
Number of abstracts 500 2,922
Total number of terms 23,500 135,969
Vocabulary size (terms) 2,430 6,150
Term average per abstract 47 46.53


Table 2: Categories in corpora
  Number Subset of Full
Category of texts hep-ex collection
Information Transfer and Management 1 NO YES
Particle Physics - Phenomenology 3 YES YES
Particle Physics - Experimental Results 2623 YES YES
XX 1 YES YES
Nonlinear Systems 1 YES YES
Accelerators and Storage Rings 18 YES YES
Astrophysics and Astronomy 3 YES YES
Other Fields of Physics 1 NO YES
Detectors and Experimental Techniques 271 YES YES

We have preprocessed these collections by eliminating stopwords and by applying the Porter stemmer. Due to their average size per abstract (aprox. 47 words), the preprocessed collections are suitable for our experiments. These preprocessed corpora, the set of stopwords and the stemmer can be downloaded from the project site[*].


next up previous
Next: Performance Measurement Up: Clustering of Abstracts in Previous: Clustering of Abstracts in
David Pinto 2006-05-25