next up previous
Next: Conclusions Up: Test over the whole Previous: Test over the whole

Improving Transition Point approach:

A refined method based on the Transition Point technique was proposed in order to improve the results obtained over the whole collection of hep-ex. This method was named Transition Point and Mutual Information (TPMI), and basically weight $ idtp(t,T)$ using mutual information. This value may be used as a refinement of the selection method provided by TP.

Let $ tp_T$ be the transition point of the text $ T=[t_1,\ldots,t_k]$. We can calculate MI score of each term $ t_i$ as $ MI(tp_T,t_i)$. The TPMI will assigns the final score:

$\displaystyle tpmi(t_i,T)=idtp(t_i,T)*MI(tp_T,t_i)$ (8)

$ MI(x,y)$ was computed considering $ n$-grams of $ x$, where $ y$ appears at a distance of 2 words from $ x$, and the frequency of both $ x$ and $ y$ was greater than 2.

The results obtained by using this refined method are shown in Figure 2. There we can see that this approach obtains the best value of $ F$ measure. Very similar results of clustering on the whole collection were obtained for DF and TS methods, with respect to the subset of hep-ex. Anyway, TS method reached the maximum $ F$ value (0.5925) with 43% of terms, which corresponds to a collection vocabulary size of 2,644 terms, and only 3,318 terms hold the threshold $ \beta$. Whereas the DF method is very stable, it mantains its $ F$ values below of the baseline (0.5919). TPMI method had a good high peak ($ F=0.6206$) taking 20 terms, and giving a vocabulary size of 42,167 terms


next up previous
Next: Conclusions Up: Test over the whole Previous: Test over the whole
David Pinto 2006-05-25