next up previous
Next: The KnCr corpus of Up: Description of the corpora Previous: The CICLing-2002 corpus


The hep-ex corpus of CERN

This corpus is based on the collection of abstracts compiled by the University of Jaén, Spain [16], named hep-ex, and it is composed by 2,922 abstracts from the Physics domain originally stored in the data server of the CERN.

The distribution of the categories for each corpus is better described in Table 3; other characteristics are shown in Table 4. As can be seen, this corpus is totally unbalanced, which makes this task even more challenging.


Table 3: Categories of the hep-ex corpus
Category # of abstracts
Particle physics (experimental results) 2,623
Detectors and experimental techniques 271
Accelerators and storage rings 18
Particle physics (phenomenology) 3
Astrophysics and astronomy 3
Information transfer and management 1
Nonlinear systems 1
Other fields of physics 1
XX 1


Table 4: Other features of the hep-ex corpus
Feature Value
Size of the corpus (bytes) 962,802
Number of categories 9
Number of abstracts 2,922
Total number of terms 135,969
Vocabulary size (terms) 6,150
Term average per abstract 46.53


next up previous
Next: The KnCr corpus of Up: Description of the corpora Previous: The CICLing-2002 corpus
David Pinto 2007-05-08