next up previous
Next: Preprocessing Up: Description of the corpora Previous: The hep-ex corpus of


The KnCr corpus of MEDLINE

This corpus, named KnCr, was created for the specific task of clustering short texts of a medical narrow domain [21]. It consists of 900 abstracts related with the ``Cancer'' domain. Table 5 and 6, show the complete characteristics of this new corpus.


Table 5: Categories of the KnCr corpus
Category # of abstracts   Category # of abstracts
blood 64   lung 99
bone 8   lymphoma 30
brain 14   renal 6
breast 119   skin 31
colon 51   stomach 12
genetic studies 66   therapy 169
genitals 160   thyroid 20
liver 29   Other (XXX) 22


Table 6: Other features of the KnCr corpus
Feature Value
Size of the corpus (bytes) 834,212
Number of categories 16
Number of abstracts 900
Total number of terms 113,822
Vocabulary size (terms) 11,958
Term average per abstract 126.47


next up previous
Next: Preprocessing Up: Description of the corpora Previous: The hep-ex corpus of
David Pinto 2007-05-08