Next: Term Selection Methods
Up: Clustering Abstracts of Scientific
Previous: Introduction
The Transition Point Technique
The Transition Point (TP) is a frequency value that splits the vocabulary of a
document into two sets of terms (low and high frequency). This technique
is based on the Zipf Law of Word Ocurrences [22] and also on the refined
studies of Booth [2], as well as Urbizagástegui [20]. These
studies are meant to demonstrate that terms of medium frequency are
closely related to the conceptual content of a document. Therefore, it is
possible to form the hypothesis that terms whose frequency is closer to TP can be used as indexes of
a document. A typical formula used to obtain
this value is given in equation 1:
|
(1) |
where represents the number of words with frequency equal to in the text .
[15] [20]. Alternatively, can be localized by identifying the lowest frequency (from the highest
frequencies) that it is not repeated; this characteristic comes from the properties of Booth's
law for low frequency words [2].
Let us consider a frequency-sorted vocabulary
of a text T; i.e.,
, with
, then
, iif
. The most important words are
those that obtain the closest frequency values to TP, i.e.,
|
(2) |
where is a lower threshold obtained by a given neighbourhood
value of the TP, thus,
(
). is the
upper threshold and it is calculated in a similar way (
).
The TP technique has been used in different areas of Natural Language Processing (NLP)
like: clustering of short texts [5], categorization of texts [12] [13],
keyphrases extraction [14] [19], summarization [3],
and weighting models for information retrieval systems [4]. Thus,
we believe that there exists enough evidence to use this technique
as a term selection process.
Next: Term Selection Methods
Up: Clustering Abstracts of Scientific
Previous: Introduction
David Pinto
2006-05-25