next up previous
Next: Description of the FSTs Up: Feature Selection Techniques Previous: Feature Selection Techniques


The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a document into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [Zipf1949] and also on the refined studies of Booth [Booth1967], as well as Urbizagástegui [Urbizagástegui1999]. These studies are meant to demonstrate that terms of medium frequency are closely related to the conceptual content of a document. Therefore, it is possible to form the hypothesis that terms whose frequency is closer to TP can be used as indexes of a document. A typical formula used to obtain this value is given in equation 1:


\begin{displaymath}
TP_V = \frac{\sqrt{8*I_1+1} - 1}{2},
\end{displaymath} (1)

where $I_1$ represents the number of words with frequency equal to $1$ in the text $T$ [Moyotl and Jiménez2004b] [Urbizagástegui1999]. Alternatively, $TP_V$ can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated; this characteristic comes from the properties of Booth's law for low frequency words [Booth1967].

Let us consider a frequency-sorted vocabulary of a text T; i.e.,

\begin{displaymath}V = [(t_1, f_1), ..., (t_n, f_n)],\end{displaymath}

with $f_i
\geq f_{i-1}$, then $TP_V = f_{i-1}$, iif $f_i=f_{i+1}$. The most important words are those that obtain the closest frequency values to TP, i.e.,


\begin{displaymath}
V_{TP} = \{ t_i \vert (t_i, f_i) \in V, U_1 \leq f_i \leq U_2 \},
\end{displaymath} (2)

where $U_1$ is a lower threshold obtained by a given neighbourhood value of the TP, thus, $U_1 = (1-NTP)*TP_V$ ($NTP\in[0,1]$). $U_2$ is the upper threshold and it is calculated in a similar way ( $U_2 = (1+NTP)*TP_V$).

The TP technique has been used in different areas of Natural Language Processing like: clustering of short texts [Jiménez, Pinto, and Rosso2005a], categorization of texts [Moyotl and Jiménez2004a] [Moyotl-Hernández and Jiménez-Salazar2005], keyphrases extraction [Pinto and Pérez2004] [Tovar

2005
], summarization [Bueno, Pinto, and Jiménez-Salazar2005], and weighting models for information retrieval systems [Cabrera, Pinto, and H. Jiménez2005]. Therefore, we believe that there exists enough evidence to use this technique as a term selection process.


next up previous
Next: Description of the FSTs Up: Feature Selection Techniques Previous: Feature Selection Techniques
David Pinto 2006-05-25