Entropy

Next: Transition Point Up: Term Selection and Weighting Previous: Term Selection and Weighting

Entropy

Determination of a set of words that characterize a set of documents given, is the focus of our work. Given a set of documents $D=\{D_1, D_2, ..., D_M\}$ , and $N_{i}$ the number of words in the document

, the relative frequency of the word

is defined as follows:

$\displaystyle f_{ij}=\frac{tf_{ij}}{N_{i}^{tf_{ij}}},$

(2)

and

$\displaystyle p_{ij}=\frac{f_{ij}}{\sum_{j=1}^m f_{ij}}$

(3)

is the probability of the word

be in

. Thus, entropy of

can be calculated as:

$\displaystyle H(w_j)=-\sum_{i=1}^M p_{ij}\log p_{ij}.$

(4)

The representation of a document is given by the VSM, whenever terms have high entropy. Let $H_{max}$ be the maximum value of entropy on all the terms, $H_{max}=\max_j H(w_j),$ the representation based on entropy of is

$\displaystyle H_i=[w_j\in D_i\vert H(w_j)>H_{max}\cdot u],$

(5)

where

is a threshold which defines the level of high entropy. In our experiments we have set

Next: Transition Point Up: Term Selection and Weighting Previous: Term Selection and Weighting

David Pinto 2007-05-08