The CICLing-2002 corpus

Next: The hep-ex corpus of Up: Description of the corpora Previous: Description of the corpora

The CICLing-2002 corpus

This corpus is made up by 48 abstracts from the Computational Linguistics domain, which corresponds to the conference CICLing 2002. This collection was used by Makagonov et al. [15] in their experiments on clustering short texts of narrow domains. We consider it a very small but a needed reference corpus, also for manually investigating the obtained results.

The topics of this corpus are the following ones: Linguistic (semantics, syntax, morphology, and parsing), Ambiguity (WSD, anaphora, POS, and spelling), Lexicon (lexics, corpus, and text generation), and Text Processing (information retrieval, summarization, and classification of texts). The distribution and the features of this corpus are shown in Tables 1 and 2, respectively.

**Table 1:** Distribution of the *CICLing-2002* corpus
Category	# of abstracts
Linguistics	11
Ambiguity	15
Lexicon	11
Text Processing	11

**Table 2:** Other features of the *CICLing-2002* corpus
Feature	Value
Size of the corpus (bytes)	23,971
Number of categories	4
Number of abstracts	48
Total number of terms	3,382
Vocabulary size (terms)	953
Term average per abstract	70.45

Next: The hep-ex corpus of Up: Description of the corpora Previous: Description of the corpora

David Pinto 2007-05-08