next up previous
Next: The hep-ex corpus Up: Dataset Previous: Dataset

The CICLing corpus

This corpus is balanced and it is composed by 48 abstracts from the "Computational Linguistic and Text Processing" domain, which were extracted from the CICLing 2002 conference1. The topics of this corpus are the following: Linguistic (semantics, syntax, morphology, and parsing), Ambiguity (WSD, anaphora, POS, and spelling), Lexical (lexics, corpus, and text generation), and Text processing (information retrieval, summarization, and classification of texts). The distribution as the features of this corpus are shown in Tables 1 and 2.


Table: Distribution of CICLing
Category # of abstracts
Linguistics 11
Ambiguity 15
Lexical 11
Text processing 11
Total 48


Table: Other features of CICLIng
Feature Value
Size of the corpus (bytes) 23.971
Number of categories 4
Number of abstracts 48
Total number of terms 3.382
Vocabulary size (terms) 953
Term average per abstract 70.45


next up previous
Next: The hep-ex corpus Up: Dataset Previous: Dataset
David Pinto 2006-05-25