Next: Information Retrieval Model
Up: Description of TPIRS
Previous: Description of TPIRS
The Transition Point (TP) is a frequency value that splits the vocabulary of a
document into two sets of terms (low and high frequency). This technique
is based on the Zipf Law of Word Ocurrences [18] and also on the refined
studies of Booth [2], as well as of Urbizagástegui
[17]. These
studies are meant to demonstrate that mid-frequency terms are
closely related to the conceptual content of a document. Therefore, it is
possible to form the hypothesis that terms closer to TP can be used as indexes of
a document. A typical formula used to obtain
this value is given in equation 1:
|
(1) |
where represents the number of words with frequency equal to
[12] [17].
Alternatively,
TP can be localized by identifying the lowest frequency (from the highest
frequencies) that
it is not repeated in each document; this characteristic comes from the properties of the Booth's
law of low frequency words [2]. In our experiments we have used this approach.
Let us consider a frequency-sorted vocabulary
of a document; i.e.,
, with
,
then , iif . The most important words are
those that obtain the closest
frequency values to TP, i.e.,
|
(2) |
where is a lower threshold obtained by a given neighbourhood
percentage of TP (NTP), thus,
. is the
upper threshold and it is calculated in a similar way (
).
We have used the TP technique in different areas of Natural Language Processing (NLP)
like: clustering of short texts [7], categorization of texts [9],
keyphrases extraction [10] [16], summarization [3],
and weighting models for information retrieval systems [4]. Thus,
we believe that there exist enough evidence to use this technique
as a terms reduction process.
Next: Information Retrieval Model
Up: Description of TPIRS
Previous: Description of TPIRS
David Pinto
2006-05-25