next up previous
Next: Transition Point Up: Term Selection and Weighting Previous: Term Selection and Weighting

Entropy

Determination of a set of words that characterize a set of documents given, is the focus of our work. Given a set of documents $ D=\{D_1, D_2, ..., D_M\}$, and $ N_{i}$ the number of words in the document $ D_i$, the relative frequency of the word $ w_j$ in $ D_i$ is defined as follows:

$\displaystyle f_{ij}=\frac{tf_{ij}}{N_{i}^{tf_{ij}}},$ (2)

and

$\displaystyle p_{ij}=\frac{f_{ij}}{\sum_{j=1}^m f_{ij}}$ (3)

is the probability of the word $ w_j$ be in $ D_i$. Thus, entropy of $ w_j$ can be calculated as:

$\displaystyle H(w_j)=-\sum_{i=1}^M p_{ij}\log p_{ij}.$ (4)

The representation of a document $ D_i$ is given by the VSM, whenever terms have high entropy. Let $ H_{max}$ be the maximum value of entropy on all the terms, $ H_{max}=\max_j H(w_j),$ the representation based on entropy of $ D_i$ is

$\displaystyle H_i=[w_j\in D_i\vert H(w_j)>H_{max}\cdot u],$ (5)

where $ u$ is a threshold which defines the level of high entropy. In our experiments we have set $ u=0.5$.


next up previous
Next: Transition Point Up: Term Selection and Weighting Previous: Term Selection and Weighting
David Pinto 2007-05-08