next up previous
Next: Term Selection Methods Up: Clustering Abstracts of Scientific Previous: Introduction


The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a document into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [22] and also on the refined studies of Booth [2], as well as Urbizagástegui [20]. These studies are meant to demonstrate that terms of medium frequency are closely related to the conceptual content of a document. Therefore, it is possible to form the hypothesis that terms whose frequency is closer to TP can be used as indexes of a document. A typical formula used to obtain this value is given in equation 1:

$\displaystyle tp_T = \frac{\sqrt{8*I_1+1} - 1}{2},$ (1)

where $ I_1$ represents the number of words with frequency equal to $ 1$ in the text $ T$. [15] [20]. Alternatively, $ tp_T$ can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated; this characteristic comes from the properties of Booth's law for low frequency words [2].

Let us consider a frequency-sorted vocabulary of a text T; i.e.,

$\displaystyle V = [(t_1, f_1), ..., (t_n, f_n)]$

, with $ f_i
\geq f_{i-1}$, then $ tp_T = f_{i-1}$, iif $ f_i=f_{i+1}$. The most important words are those that obtain the closest frequency values to TP, i.e.,

$\displaystyle V_{TP} = \{ t_i \vert (t_i, f_i) \in V, U_1 \leq f_i \leq U_2 \},$ (2)

where $ U_1$ is a lower threshold obtained by a given neighbourhood value of the TP, thus, $ U_1 = (1-NTP)*TPV$ ( $ NTP\in[0,1]$). $ U_2$ is the upper threshold and it is calculated in a similar way ( $ U_2 = (1+NTP)*TPV$).

The TP technique has been used in different areas of Natural Language Processing (NLP) like: clustering of short texts [5], categorization of texts [12] [13], keyphrases extraction [14] [19], summarization [3], and weighting models for information retrieval systems [4]. Thus, we believe that there exists enough evidence to use this technique as a term selection process.


next up previous
Next: Term Selection Methods Up: Clustering Abstracts of Scientific Previous: Introduction
David Pinto 2006-05-25