next up previous
Next: Term Enrichment Up: Term Selection and Weighting Previous: Entropy

Transition Point

Given a document $ D_i$ and its vocabulary $ V_i=\{(w_j, tf_i(w_j))\vert w_j \in D_i\}$, where $ tf_i(w_j)=tf_{ij}$, let $ TP_i$ be the transition point of $ D_i$. A set of important terms which will represent the document $ D_i$ may be calculated as follows:

$\displaystyle R_i=\{w_j\vert((w_j,tf_{ij})\in V_i),(TP_i\cdot(1-u)\le tf_{ij}\le TP_i\cdot(1+u))\},$ (6)

where $ u$ is a value in $ [0,1]$. Some experiments presented in [13] have shown that $ u=0.4$ is a good value for this threshold.

For the representation schema, we consider that the important terms are those whose frequencies are closer to the TP. Therefore, a term with frequency very ``close'' to TP will get a high weight, and those ``far'' to TP will get a weight close to zero. For each term $ w_j \in R_i$, its weight, given by Equation (1), is modified according to the distance between its frequency and the transition point, obtaining a new value for its ``term frequency'' (see Equation (7)).

$\displaystyle tf_{ij}'=\Vert R_i\Vert - \vert TP_i-tf_{ij}\vert %-\sqrt{(TP_i-tf_{ij})^2}
$ (7)

David Pinto 2007-05-08