The distribution of the categories for each corpus is better described in Table 3; other characteristics are shown in Table 4. As can be seen, this corpus is totally unbalanced, which makes this task even more challenging.
Category | # of abstracts |
Particle physics (experimental results) | 2,623 |
Detectors and experimental techniques | 271 |
Accelerators and storage rings | 18 |
Particle physics (phenomenology) | 3 |
Astrophysics and astronomy | 3 |
Information transfer and management | 1 |
Nonlinear systems | 1 |
Other fields of physics | 1 |
XX | 1 |
Feature | Value |
Size of the corpus (bytes) | 962,802 |
Number of categories | 9 |
Number of abstracts | 2,922 |
Total number of terms | 135,969 |
Vocabulary size (terms) | 6,150 |
Term average per abstract | 46.53 |