Data Set

Next: Performance Measurement Up: Clustering of Abstracts in Previous: Clustering of Abstracts in

Data Set

In our experiments we used two corpora based on the collection of abstracts compiled and provided to us by the University of Jaén, Spain [11], named hep-ex. The first corpus was built by extracting a subset of documents from the full collection. We used the full collection as a second corpus, which is composed by 2,922 abstracts from the Physics domain originally stored in CERN. The distribution obtained for both corpora is shown in Table 1. The distribution of the categories for each corpus is better described in Table 2.

**Table 1:** Collections (preprocessed) features
Feature	Subset of *hep-ex*	Full collection *hep-ex*
Size of the corpus (bytes)	165,349	962,802
Number of categories	7	9
Number of abstracts	500	2,922
Total number of terms	23,500	135,969
Vocabulary size (terms)	2,430	6,150
Term average per abstract	47	46.53

**Table 2:** Categories in corpora
	Number	Subset of	Full
Category	of texts	*hep-ex*	collection
Information Transfer and Management	1	NO	YES
Particle Physics - Phenomenology	3	YES	YES
Particle Physics - Experimental Results	2623	YES	YES
XX	1	YES	YES
Nonlinear Systems	1	YES	YES
Accelerators and Storage Rings	18	YES	YES
Astrophysics and Astronomy	3	YES	YES
Other Fields of Physics	1	NO	YES
Detectors and Experimental Techniques	271	YES	YES

We have preprocessed these collections by eliminating stopwords and by applying the Porter stemmer. Due to their average size per abstract (aprox. 47 words), the preprocessed collections are suitable for our experiments. These preprocessed corpora, the set of stopwords and the stemmer can be downloaded from the project site.

Next: Performance Measurement Up: Clustering of Abstracts in Previous: Clustering of Abstracts in

David Pinto 2006-05-25