next up previous
Next: Description of the experiments Up: Dataset Previous: The CICLing corpus

The hep-ex corpus

This corpus is based on the collection of abstracts compiled by the University of Jaén, Spain [Montejo-Ráez, Urena-López, and Steinberger2005], named hep-ex, and it is composed by 2.922 abstracts from the Physics domain originally stored in the data server of the ``Conseil Européen pour la Recherche Nucléaire'' (CERN)2.

The distribution of the categories for each corpus is better described in Table 3, while other set of characteristics are shown in Table 4. As can be seen, this corpus is totally unbalanced, which makes this task even more challenging.

Table: Categories of hep-ex
Category # of texts
- Particle physics
(experimental results) 2.623
- Detectors and
experimental techniques 271
- Accelerators and
storage rings 18
- Particle physics
(phenomenology) 3
- Astrophysics and astronomy 3
- Information transfer and
management 1
- Nonlinear systems 1
- Other fields of physics 1
- XX 1
Total 2.922

Table: Other features of hep-ex
Feature Value
Size of the corpus (bytes) 962.802
Number of categories 9
Number of abstracts 2.922
Total number of terms 135.969
Vocabulary size (terms) 6.150
Term average per abstract 46,53

We have preprocessed these collections by eliminating stopwords and by applying the Porter stemmer. Due to their average size per abstract, the preprocessed collections are suitable for our experiments.

next up previous
Next: Description of the experiments Up: Dataset Previous: The CICLing corpus
David Pinto 2006-05-25