next up previous
Next: Experimental results Up: Feature Selection Techniques Previous: The Transition Point Technique

Description of the FSTs used

  1. Document Frequency (DF): This technique assigns the value $df_t$ to each term $t$, where $df_t$ means the number of texts, in a collection, where $t$ ocurrs. This technique assumes that low frequency terms will rarely appear in other documents, therefore, they will not have significance on the prediction of the class for this text.

  2. Term Strength (TS): The weight given to each term $t$ is defined by the following equation:

    \begin{displaymath}ts_t = Pr(t \in T_i \vert t \in T_j), \mbox{with } i \ne j,\end{displaymath}

    where $sim(T_i, T_j) \ge \beta$, and $\beta$ is a threshold that must be tuned by reviewing the similarity matrix. A high value of $ts_t$ means that the term $t$ contributes to the texts $T_i$ and $T_j$ to be more similar than $\beta$. A more detailed description can be found in [Yang1995].

  3. Transition Point (TP): A higher value of weight is given to each term $t$, as its frequency is closer to the TP frequency, named $TP_V$. The following equation shows how to calculate this value:

    \begin{displaymath}idtp(t,T) = \frac{1}{\vert TP_V - freq(t,T)\vert+1}
,\end{displaymath}

    where $freq(t,T)$ is the frequency of the term $t$ in the document $T$.


The unsupervised techniques presented here are the most successful in the clustering area. Particulary, DF is an effective and simple technique, and it is known that it obtains comparable results to the classical supervised techniques like $\chi^2$ (CHI) and Information Gain (IG) [Sebastiani2002]. TP also has a simple calculation procedure, and as it was seen in subsection 2.1, it can be used in different areas of NLP. The DF and TP techniques have a temporal linear complexity with respect to the number of terms of the data set. On the other hand, TS is computationally more expensive than DF and TP, because it requires to calculate a similarity matrix of texts, which implies this technique to be in $O(n^2)$, where $n$ is the number of texts in the data set.


next up previous
Next: Experimental results Up: Feature Selection Techniques Previous: The Transition Point Technique
David Pinto 2006-05-25