The hep-ex corpus

Next: Description of the experiments Up: Dataset Previous: The CICLing corpus

The hep-ex corpus

This corpus is based on the collection of abstracts compiled by the University of Jaén, Spain [Montejo-Ráez, Urena-López, and Steinberger2005], named hep-ex, and it is composed by 2.922 abstracts from the Physics domain originally stored in the data server of the ``Conseil Européen pour la Recherche Nucléaire'' (CERN)².

The distribution of the categories for each corpus is better described in Table 3, while other set of characteristics are shown in Table 4. As can be seen, this corpus is totally unbalanced, which makes this task even more challenging.

**Table:** Categories of *hep-ex*
Category	# of texts
- Particle physics
(experimental results)	2.623
- Detectors and
experimental techniques	271
- Accelerators and
storage rings	18
- Particle physics
(phenomenology)	3
- Astrophysics and astronomy	3
- Information transfer and
management	1
- Nonlinear systems	1
- Other fields of physics	1
- XX	1
Total	2.922

**Table:** Other features of *hep-ex*
Feature	Value
Size of the corpus (bytes)	962.802
Number of categories	9
Number of abstracts	2.922
Total number of terms	135.969
Vocabulary size (terms)	6.150
Term average per abstract	46,53

We have preprocessed these collections by eliminating stopwords and by applying the Porter stemmer. Due to their average size per abstract, the preprocessed collections are suitable for our experiments.

Next: Description of the experiments Up: Dataset Previous: The CICLing corpus

David Pinto 2006-05-25