next up previous
Next: Clustering of Abstracts in Up: Clustering Abstracts of Scientific Previous: The Transition Point Technique


Term Selection Methods

Up to now, different term selection methods have been used in the clustering task; however, as we mentioned in Section 1, clustering abstracts for a narrow domain implies the well known problem of the unidentified number of categories to be used in the clustering process. This led us to use unsupervised methods instead of supervised ones, as well as the identification of new categories, which is very usual in the domain of digital libraries. In this section we will describe the unsupervised term selection methods used in our experiments.

  1. Document Frequency (DF): This method assigns the value $ df_t$ to each term $ t$, where $ df_t$ means the number of texts, in a collection, where $ t$ ocurrs. This method assumes that low frequency terms will rarely appear in other documents, and therefore, they will not have significance on the prediction of the class for this text.

  2. Term Strength (TS): The weight given to each term $ t$ is defined by the following equation:

    $\displaystyle ts_t = Pr(t \in T_i \vert t \in T_j),$   with $\displaystyle i \ne j,$

    where $ sim(T_i, T_j) \ge \beta$, and $ \beta$ is a threshold that must be tuned by reviewing the similarity matrix. A high value of $ ts_t$ means that the term $ t$ contributes to the texts $ T_i$ and $ T_j$ to be more similar than $ \beta$. A more detailed description may be found in [21].

  3. Transition Point (TP): A higher value of weight is given to each term $ t$, as its frequency is closer to the TP frequency, named $ tp_T$. The following equation shows how to calculate this value:

    $\displaystyle idtp(t,T) = \frac{1}{\vert tp_T - freq(t,T)\vert+1},$

    where $ freq(t,T)$ is the frequency of the term $ t$ in the document $ T$.


The unsupervised methods presented here are the most succesful in the clustering area. Particulary, DF is an effective and simple method, and it is known that this method obtains comparable results to the classical supervised methods like $ \chi^2$ (CHI) and Information Gain (IG) [17]. TP also has a simple calculation procedure, and as it was seen in Section 2, it can be used in different areas of NLP. The DF and TP methods have a temporal linear complexity with respect to the number of terms of the data set. On the other hand, TS is computationally more expensive than DF and TP, because it requires to calculate a similarity matrix of texts, which implies this method to be in $ O(n^2)$, where $ n$ is the number of texts in the data set.


next up previous
Next: Clustering of Abstracts in Up: Clustering Abstracts of Scientific Previous: The Transition Point Technique
David Pinto 2006-05-25