next up previous
Next: Discussion Up: Experiments Previous: Data Description


Figure 1 shows an interpolation of the average precision at different standard recall levels [1]. Two of these curves were previously presented: the classical VSM and TP [13]; therefore, we are using them as a reference for our own results. The three remained curves were obtained by using the representation schemas presented at section 2: H, terms obtained by using entropy; TP', enriched terms by bigrams; and H+TP, the union of H and TP.

Figure 1: Performance of term selection using entropy ($ H$) and transition point ($ TP$).

The TP-based method shows a better performance than the classical VSM by using low computational resources. On the other hand, the entropy-based method has a very good performance but with a higher computational cost. The TP approach, enriched with bigrams, obtained a similar performance than the entropy. Finally, the union of entropy and TP curve may indicate that the weighting procedure (by using both, Equation (1) and (7)) is not giving an adequated importance to terms, since precision diminished after 0.6 of recall level.

The vocabulary size for each method is shown in Table 1. Entropy did the highest reduction (it just uses the 3.3% of the original term space). TP enrichment obtained the highest vocabulary size, except for VSM, but its results are competitive with the entropy method and, with so much light computation consumption than entropy does.

Table 1: Term reduction methods and the vocabulary size obtained for TREC-5.
Method Vocabulary Percentage
name size of reduction
VSM 235,808 0.00
TP 28,111 88.08
H 7,870 96.70
TP' 36,442 84.55
H+TP 29,117 87.66

next up previous
Next: Discussion Up: Experiments Previous: Data Description
David Pinto 2007-05-08