next up previous
Next: Information Retrieval Model Up: Description of TPIRS Previous: Description of TPIRS

The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a document into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [18] and also on the refined studies of Booth [2], as well as of Urbizagástegui [17]. These studies are meant to demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to form the hypothesis that terms closer to TP can be used as indexes of a document. A typical formula used to obtain this value is given in equation 1:


\begin{displaymath}
TP = \frac{\sqrt{8*I_1+1} - 1}{2},
\end{displaymath} (1)

where $I_1$ represents the number of words with frequency equal to $1$ [12] [17].

Alternatively, TP can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated in each document; this characteristic comes from the properties of the Booth's law of low frequency words [2]. In our experiments we have used this approach.

Let us consider a frequency-sorted vocabulary of a document; i.e., $V_{TP} = [(t_1, f_1), ..., (t_n, f_n)]$, with $f_i
\geq f_{i-1}$, then $TP = f_{i-1}$, iif $f_i=f_{i+1}$. The most important words are those that obtain the closest frequency values to TP, i.e.,


\begin{displaymath}
TP_{SET}=\{ t_i \vert (t_i, f_i) \in V_{TP}, U_1 \leq f_i \leq U_2 \},
\end{displaymath} (2)

where $U_1$ is a lower threshold obtained by a given neighbourhood percentage of TP (NTP), thus, $U_1 = (1-NTP)*TP$. $U_2$ is the upper threshold and it is calculated in a similar way ( $U_2 = (1+NTP)*TP$).

We have used the TP technique in different areas of Natural Language Processing (NLP) like: clustering of short texts [7], categorization of texts [9], keyphrases extraction [10] [16], summarization [3], and weighting models for information retrieval systems [4]. Thus, we believe that there exist enough evidence to use this technique as a terms reduction process.


next up previous
Next: Information Retrieval Model Up: Description of TPIRS Previous: Description of TPIRS
David Pinto 2006-05-25