The Transition Point Technique

Next: Term Enrichment Up: Description of the search Previous: Description of the search

The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a text into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [10] and also on the refined studies of Booth [1], as well as of Urbizagástegui [9]. These studies are meant to demonstrate that mid-frequency terms, of a text

, are closely related to the conceptual content of

. Therefore, it is possible to establish the hypothesis that terms closer to TP can be used as index terms of

. A typical formula used to obtain this value is: $TP = (\sqrt{8*I_1+1} - 1)/2,$ where

represents the number of words with frequency equal to

; see [5] [9].

Alternatively, TP can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated in the text; this characteristic comes from the properties of the Booth's law of low frequency words [1]. In our experiments we have used this approach.

Let us consider a frequency-sorted vocabulary of a document; i.e., $V_{TP} = [(t_1, f_1), ..., (t_n, f_n)]$ , with $f_i \geq f_{i+1}$ , then $TP = f_{i-1}$ , iif $f_i=f_{i+1}$ . The most important words are those nearest to the TP, i.e.,