next up previous
Next: The hep-ex corpus of Up: Description of the corpora Previous: Description of the corpora


The CICLing-2002 corpus

This corpus is made up by 48 abstracts from the Computational Linguistics domain, which corresponds to the conference CICLing 2002. This collection was used by Makagonov et al. [15] in their experiments on clustering short texts of narrow domains. We consider it a very small but a needed reference corpus, also for manually investigating the obtained results.

The topics of this corpus are the following ones: Linguistic (semantics, syntax, morphology, and parsing), Ambiguity (WSD, anaphora, POS, and spelling), Lexicon (lexics, corpus, and text generation), and Text Processing (information retrieval, summarization, and classification of texts). The distribution and the features of this corpus are shown in Tables 1 and 2, respectively.


Table 1: Distribution of the CICLing-2002 corpus
Category # of abstracts
Linguistics 11
Ambiguity 15
Lexicon 11
Text Processing 11


Table 2: Other features of the CICLing-2002 corpus
Feature Value
Size of the corpus (bytes) 23,971
Number of categories 4
Number of abstracts 48
Total number of terms 3,382
Vocabulary size (terms) 953
Term average per abstract 70.45


next up previous
Next: The hep-ex corpus of Up: Description of the corpora Previous: Description of the corpora
David Pinto 2007-05-08