The Transition Point Technique

Next: Information Retrieval Model Up: Description of TPIRS Previous: Description of TPIRS

The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a document into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [18] and also on the refined studies of Booth [2], as well as of Urbizagástegui [17]. These studies are meant to demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to form the hypothesis that terms closer to TP can be used as indexes of a document. A typical formula used to obtain this value is given in equation 1:

$\begin{displaymath} TP = \frac{\sqrt{8*I_1+1} - 1}{2}, \end{displaymath}$

(1)

where

represents the number of words with frequency equal to

[12] [17].

Alternatively, TP can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated in each document; this characteristic comes from the properties of the Booth's law of low frequency words [2]. In our experiments we have used this approach.

Let us consider a frequency-sorted vocabulary of a document; i.e., $V_{TP} = [(t_1, f_1), ..., (t_n, f_n)]$ , with $f_i \geq f_{i-1}$ , then $TP = f_{i-1}$ , iif $f_i=f_{i+1}$ . The most important words are those that obtain the closest frequency values to TP, i.e.,

$\begin{displaymath} TP_{SET}=\{ t_i \vert (t_i, f_i) \in V_{TP}, U_1 \leq f_i \leq U_2 \}, \end{displaymath}$

(2)

where

is a lower threshold obtained by a given neighbourhood percentage of TP (NTP), thus,

is the upper threshold and it is calculated in a similar way (

We have used the TP technique in different areas of Natural Language Processing (NLP) like: clustering of short texts [7], categorization of texts [9], keyphrases extraction [10] [16], summarization [3], and weighting models for information retrieval systems [4]. Thus, we believe that there exist enough evidence to use this technique as a terms reduction process.

Next: Information Retrieval Model Up: Description of TPIRS Previous: Description of TPIRS

David Pinto 2006-05-25