Test over the whole hep-ex collection

An experiment was performed using the entire collection and applying the three methods described in Section 3. In this case, the noise words had a notably effect, mainly in the TP method. Since TP method selects one term per time for each text, a wrong selection may be crucial in the clustering task. In some cases, this iterative process includes words that change dramatically the composition of texts. Thus, a term with very low DF value changes threshold used in the clustering task. We tried to face this problem with an enrichment of terms selected by TP. It is not possible to solve this task using related terms dictionaries like WordNet, since the terminology of texts is very specialized (see [6]). The problem was solved using $ n$-grams as an approximation to related words.

Figure 2: Behaviour of DF, TS and TPMI term selection methods.


David Pinto 2006-05-25