Feature | Subset of hep-ex | Full collection hep-ex |
Size of the corpus (bytes) | 165,349 | 962,802 |
Number of categories | 7 | 9 |
Number of abstracts | 500 | 2,922 |
Total number of terms | 23,500 | 135,969 |
Vocabulary size (terms) | 2,430 | 6,150 |
Term average per abstract | 47 | 46.53 |
Number | Subset of | Full | |
Category | of texts | hep-ex | collection |
Information Transfer and Management | 1 | NO | YES |
Particle Physics - Phenomenology | 3 | YES | YES |
Particle Physics - Experimental Results | 2623 | YES | YES |
XX | 1 | YES | YES |
Nonlinear Systems | 1 | YES | YES |
Accelerators and Storage Rings | 18 | YES | YES |
Astrophysics and Astronomy | 3 | YES | YES |
Other Fields of Physics | 1 | NO | YES |
Detectors and Experimental Techniques | 271 | YES | YES |
We have preprocessed these collections by eliminating stopwords and by applying the Porter stemmer. Due to their average size per abstract (aprox. 47 words), the preprocessed collections are suitable for our experiments. These preprocessed corpora, the set of stopwords and the stemmer can be downloaded from the project site.