Term Enrichment

Next: Union of Entropy and Up: Term Selection and Weighting Previous: Transition Point

Term Enrichment

Although TP certainly reduces space dimensionality by increasing precision, it obtains a low recall. Due to this fact we are proposing to enrich the terms selected by this method with those which have similar characteristics, by using a co-ocurrence bigrams-based formula. Formally, given a document

made up of only those terms selected by using the TP approach (

), the new important terms for

will be obtained as follows:

$\displaystyle R_i'=R_i\cup\{w'\vert (w_j \in R_i), (v=w'w_j$ or $\displaystyle v=w_jw'), (v\in D_i), (tf_i(v)>1)\}.$

(8)

That is, we only used a window of size one around each term of

, and a minimum frequency of two for each bigram was required as condition to include new terms.

As , weighting for enriched terms follows Equations (1) and (7). Terms $\{w'\vert w'\in R_i' \land w'\notin R_i\}$ will use directly the Equation (1).

David Pinto 2007-05-08