Next: Conclusions Up: Clustering Narrow-Domain Short Texts Previous: Performance measurement

Results

In the experiments we have carried out, the DF and TS techniques do not improve the results obtained by the transition point technique, which reinforces the hypothesis suggested by [19]. Besides, we have observed that there is not a significant difference between any of the symmetric KL distances. Therefore, we consider that in other applications, the simplest one should be used. Tables 7, 8 and, 9 show our evaluation results for all Kullback-Leibler approaches implemented, by using the CICLing-2002, hep-ex and, KnCr corpus, respectively. In each table, we have defined three sections, named (a), (b) and, (c), each one corresponding to the use of the TP, DF and, TS feature selection technique, respectively. In the first column we have named as KullbackOriginal, KullbackBigi, KullbackJensen and, KullbackMax, the KLD defined by Kullback and Leibler [13], Bigi [4], Jensen [10], and Bennet [2] [27], respectively.

Table 7: Results obtained by using the CICLing-2002 corpus

		(a)-TP
	SLC	CLC	KStar
KullbackOriginal	0,6	0,7	0,7
KullbackBigi	0,6	0,7	0,7
KullbackJensen	0,6	0,6	0,7
KullbackMax	0,6	0,7	0,7

	(b)-DF
SLC	CLC	KStar
0,6	0,6	0,6
0,6	0,7	0,6
0,6	0,6	0,6
0,6	0,7	0,6

	(c)-TS
SLC	CLC	KStar
0,5	0,6	0,6
0,5	0,5	0,6
0,5	0,6	0,6
0,5	0,6	0,6

Table 8: Results obtained by using the hep-ex corpus

		(a)-TP
	SLC	CLC	KStar
KullbackOriginal	0,86	0,83	0,68
KullbackBigi	0,86	0,82	0,69
KullbackJensen	0,85	0,83	0,68
KullbackMax	0,86	0,83	0,69

	(b)-DF
SLC	CLC	KStar
0,60	0,83	0,68
0,60	0,82	0,67
0,61	0,83	0,69
0,61	0,83	0,68

	(c)-TS
SLC	CLC	KStar
0,80	0,84	0,67
0,80	0,85	0,67
0,80	0,83	0,66
0,80	0,85	0,67

Table 9: Results obtained by using the KnCr corpus

		(a)-TP
	SLC	CLC	KStar
KullbackOriginal	0,52	0,38	0,39
KullbackBigi	0,52	0,38	0,39
KullbackJensen	0,52	0,36	0,40
KullbackMax	0,51	0,37	0,40

	(b)-DF
SLC	CLC	KStar
0,51	0,37	0,38
0,51	0,37	0,38
0,52	0,36	0,39
0,51	0,37	0,39

	(c)-TS
SLC	CLC	KStar
0,49	0,36	0,38
0,49	0,36	0,38
0,48	0,34	0,38
0,50	0,37	0,38

We have made a comparison among our results and those reported by Pinto et al. [20]. This evaluation is presented in Tables 10 and 11, where our best approach is compared with the results presented in [20], which we have named PintoetAl. The comparison could be done only by using both, the CICLing-2002 and the hep-ex corpora, because up to now, there are not published results with the characteristics needed for the KnCr corpus. We have observed that the use of KLD obtains comparable results, and we consider that this behaviour is derived from the size of each text. We are suggesting to use a smooth procedure, but the number document terms that does not appear in the corpus vocabulary can be extremely high. Further analysis will investigate this issue.

Table 10: Comparison by using the CICLing-2002 corpus

		(a)-TP
	SLC	CLC	KStar
KullbackMax	0,6	0,7	0,7
PintoetAl	0,6	0,7	0,7

	(b)-DF
SLC	CLC	KStar
0,6	0,7	0,6
0,6	0,7	0,6

	(c)-TS
SLC	CLC	KStar
0,5	0,6	0,6
0,5	0,7	0,6

Table 11: Comparison by using the hep-ex corpus

		(a)-TP
	SLC	CLC	KStar
KullbackMax	0,86	0,83	0,69
PintoetAl	0,77	0,87	0,69

	(b)-DF
SLC	CLC	KStar
0,61	0,83	0,68
0,59	0,86	0,68

	(c)-TS
SLC	CLC	KStar
0,80	0,85	0,67
0,74	0,86	0,67

Next: Conclusions Up: Clustering Narrow-Domain Short Texts Previous: Performance measurement

David Pinto 2007-05-08