next up previous
Next: Term Enrichment Up: Description of the search Previous: Description of the search

The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a text into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [10] and also on the refined studies of Booth [1], as well as of Urbizagástegui [9]. These studies are meant to demonstrate that mid-frequency terms, of a text $T$, are closely related to the conceptual content of $T$. Therefore, it is possible to establish the hypothesis that terms closer to TP can be used as index terms of $T$. A typical formula used to obtain this value is: $TP = (\sqrt{8*I_1+1} - 1)/2,$ where $I_1$ represents the number of words with frequency equal to $1$; see [5] [9].

Alternatively, TP can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated in the text; this characteristic comes from the properties of the Booth's law of low frequency words [1]. In our experiments we have used this approach.

Let us consider a frequency-sorted vocabulary of a document; i.e., $V_{TP} = [(t_1, f_1), ..., (t_n, f_n)]$, with $f_i
\geq f_{i+1}$, then $TP = f_{i-1}$, iif $f_i=f_{i+1}$. The most important words are those nearest to the TP, i.e.,


\begin{displaymath}
TP_{SET}=\{ t_i \vert (t_i, f_i) \in V_{TP}, U_1 \leq f_i \leq U_2 \},
\end{displaymath} (1)

where $U_1$ is a lower threshold obtained by a given neighbourhood percentage of TP (NTP), thus, $U_1 = (1-NTP)*TP$. $U_2$ is the upper threshold and it is calculated in a similar way ( $U_2 = (1+NTP)*TP$). Either in WebCLEF-2005 and in the current competition, we have used $NTP=0.4$, considering that the TP technique is language independent.


next up previous
Next: Term Enrichment Up: Description of the search Previous: Description of the search
David Pinto 2007-05-08