Improving Transition Point approach:

Next: Conclusions Up: Test over the whole Previous: Test over the whole

Improving Transition Point approach:

A refined method based on the Transition Point technique was proposed in order to improve the results obtained over the whole collection of hep-ex. This method was named Transition Point and Mutual Information (TPMI), and basically weight

using mutual information. This value may be used as a refinement of the selection method provided by TP.

Let be the transition point of the text $T=[t_1,\ldots,t_k]$ . We can calculate MI score of each term as . The TPMI will assigns the final score:

$\displaystyle tpmi(t_i,T)=idtp(t_i,T)*MI(tp_T,t_i)$

(8)

was computed considering

-grams of

, where

appears at a distance of 2 words from

, and the frequency of both

and

was greater than 2.

The results obtained by using this refined method are shown in Figure 2. There we can see that this approach obtains the best value of measure. Very similar results of clustering on the whole collection were obtained for DF and TS methods, with respect to the subset of hep-ex. Anyway, TS method reached the maximum value (0.5925) with 43% of terms, which corresponds to a collection vocabulary size of 2,644 terms, and only 3,318 terms hold the threshold $\beta$ . Whereas the DF method is very stable, it mantains its values below of the baseline (0.5919). TPMI method had a good high peak () taking 20 terms, and giving a vocabulary size of 42,167 terms

Next: Conclusions Up: Test over the whole Previous: Test over the whole

David Pinto 2006-05-25