Test over a subset of hep-ex

Next: Analysis of the unstability Up: Experimental Results Previous: Experimental Results

Test over a subset of hep-ex

In order to obtain a first glance of the behaviour of each term selection method used in our experiments, we performed a first experiment over a subset of hep-ex, composed by 500 abstracts taken randomly from original collection; in the case of categories with only one instance we took randomly two categories. The threshold used as the minimum similarity accepted in the

-NN clustering method was tuned over this collection. The average of similarities was used as a threshold.

Figure 1 shows values for every term selection method executed over different percentages of the collection's vocabulary (from 600 to 2,000 terms).

Given a percentage of the collection vocabulary, DF and TS methods selected the higher score terms. TP method selected terms in a local fashion; i.e. it takes a given number of terms from each text. Therefore, comparison among methods must be done through the vocabularies obtained in each selection of terms carried out by the methods. DF and TS methods used from 2% to 70% of the vocabulary terms. This range corresponds from 21 to 1,700 of the total terms in the collection. The TP selection method took from 5 to 30 terms from each text, given a similar range of total terms. In Fig. 1, the results of these three methods are shown; the horizontal axis represents the number of terms and the vertical axis the values (eq. 6). In order to apply TS method, similarity matrix was calculated as 3-tuples ( $T_i,T_j,sim_{ij}$ ) and sorted according $sim_{ij}$ , then was computed for all terms. Since only 1,349 terms were obtained, threshold $\beta$ was fixed to 0.

**Figure 1:** Behaviour of DF, TS and TP methods in a subset of *hep-ex*.
$\begin{figure}\begin{center}\setlength{\unitlength}{0.240900pt} \ifx\plotpoin... ....0){\rule[-0.200pt]{0.400pt}{187.420pt}} \end{picture}} \end{center}\end{figure}$

DF method was very stable but it did not help to the clustering task. From the beginning, DF included the most frequent terms in the texts, and this contributed to mantain a minimum level of similarity during the clustering task. Baseline, i.e. the clustering done without term selection (), indicates that DF selects terms to represent texts that mantain resemblance with the original ones. On the other hand, TS method reached the maximum value after 700 terms, and after 900 terms it obtained stability as well as the DF method did.

TP method outperformed the other two methods. The maximum value for TP method was 0.6415. This value was reached with a vocabulary size of 1,661 terms which corresponds to only 22 terms per text. The unstability of TP method is derived from noise words that are difficult to detect because of their low frequencies. Next subsection presents an analysis of the TP selection in order to control the unstability.

Subsections

Analysis of the unstability of TP

Next: Analysis of the unstability Up: Experimental Results Previous: Experimental Results

David Pinto 2006-05-25