next up previous
Next: Bibliography Up: A Competitive Term Selection Previous: Results

Discussion

Text representation, by using the VSM, implies the problem of selecting the minimal set of index terms and, thereafter, the calculation of their weights. Despite the fact that VSM and the classical weighting have several decades of existence, nowadays they are in essence being used in a diversity of NLP tasks; e.g., text categorization, text clustering, and summarization. It is well known the empirical fact that by using all terms of a text commonly produces a noisy effect in the representation [16]. Besides, the high dimensionality of the term space has led to an index term analysis. For instance, Salton et al. [15] proposed a measurement of discrimination for index terms, i.e., terms defining vectors in the space that better discerned what documents answer a particular query. They concluded that, given a collection of $ M$ documents, the ``more discriminant'' terms have a frequency in the range $ [\frac{M}{100},\frac{M}{10}]$. A similar experiment was carried out in [8], showing that term frequencies around TP overlap the above range. This result suggested to analyze the discriminant value of terms in a neighborhood of TP [7]. TP have a good performance due to the use of mid-frequencies terms, however, many important terms in a document have a frequency far from TP. In this work, such terms were included in the document representation through a very simple procedure (bigrams), outperforming the TP method.

Entropy property of reaching maximum value with equiprobable outcomes says that the terms are used, among texts, with a relative constant frequency. This is an indicator supported by intertextual frequency on a text collection. Therefore, it would not be possible to apply the method on isolated texts or heterogeneous texts collections. We have seen, that the H method had very good performance, but the computation of the entropy for each term of the collection has a very high computational cost.

Conjecture, formuled in [5], established that terms with balanced use through the texts collection is a characteristic related with the Zipf's Law [19]: minimum effort to write a text entails a moderate use on some words, which is revealed by entropy. When dealing with many texts, it may be interpreted as preserving the regularity of occurrence of such words, as if they were relevant because of their role in the texts as pivots. In fact, from the experiments carried out in this work, it was shown that TP enrichment performed in similar manner as the entropy method. Besides, in the experiment which joins entropy and TP, the most of the terms selected by entropy were also selected by TP (87.21%). Furthermore, just the 0.78% of the H-terms do not belong to the set provided by TP'. This fact is confirmed by comparing the TP' precision-recall curve with the H curve (Fig. 1). However, there is a high amount of TP'-terms (6,711) that do not belong to neither, the TP-term nor the H-term set, which introduce an unstable behaviour: good terms and noisy terms spreads relevant and non relevant texts throughout the totally retrieved result.

Up to now, we have tested the methods proposed in only one collection, but further investigations should consider other datasets in order to see if the given conclusions carry out in those as well.

A clear advantage of the methods presented in this paper are their unsupervised nature and language independence which makes them suitable for their use in a wide variety of NLP tasks.


next up previous
Next: Bibliography Up: A Competitive Term Selection Previous: Results
David Pinto 2007-05-08