next up previous
Next: Union of Entropy and Up: Term Selection and Weighting Previous: Transition Point

Term Enrichment

Although TP certainly reduces space dimensionality by increasing precision, it obtains a low recall. Due to this fact we are proposing to enrich the terms selected by this method with those which have similar characteristics, by using a co-ocurrence bigrams-based formula. Formally, given a document $ D_i$ made up of only those terms selected by using the TP approach ($ R_i$), the new important terms for $ D_i$ will be obtained as follows:

$\displaystyle R_i'=R_i\cup\{w'\vert (w_j \in R_i), (v=w'w_j$    or $\displaystyle v=w_jw'), (v\in D_i), (tf_i(v)>1)\}.$ (8)

That is, we only used a window of size one around each term of $ R_i$, and a minimum frequency of two for each bigram was required as condition to include new terms.

As $ R_i$, weighting for enriched terms follows Equations (1) and (7). Terms $ \{w'\vert w'\in R_i' \land w'\notin R_i\}$ will use directly the Equation (1).

David Pinto 2007-05-08