The distribution of the categories for each corpus is better described in Table 3, while other set of characteristics are shown in Table 4. As can be seen, this corpus is totally unbalanced, which makes this task even more challenging.
Category | # of texts |
- Particle physics | |
(experimental results) | 2.623 |
- Detectors and | |
experimental techniques | 271 |
- Accelerators and | |
storage rings | 18 |
- Particle physics | |
(phenomenology) | 3 |
- Astrophysics and astronomy | 3 |
- Information transfer and | |
management | 1 |
- Nonlinear systems | 1 |
- Other fields of physics | 1 |
- XX | 1 |
Total | 2.922 |
Feature | Value |
Size of the corpus (bytes) | 962.802 |
Number of categories | 9 |
Number of abstracts | 2.922 |
Total number of terms | 135.969 |
Vocabulary size (terms) | 6.150 |
Term average per abstract | 46,53 |
We have preprocessed these collections by eliminating stopwords and by applying the Porter stemmer. Due to their average size per abstract, the preprocessed collections are suitable for our experiments.