Next: Transition Point
Up: Term Selection and Weighting
Previous: Term Selection and Weighting
Determination of a set of words that characterize a set of
documents given, is the focus of our work. Given a set of documents
, and the number of words
in the document , the relative frequency of the word
in is defined as follows:
|
(2) |
and
|
(3) |
is the probability of the word be in .
Thus, entropy of can be calculated as:
|
(4) |
The representation of a document is given by the VSM,
whenever terms have high entropy.
Let be the maximum value of entropy on all the terms,
the representation based on entropy of is
|
(5) |
where is a threshold which defines the level of high entropy. In our
experiments we have set .
Next: Transition Point
Up: Term Selection and Weighting
Previous: Term Selection and Weighting
David Pinto
2007-05-08