next up previous
Next: Experimental results Up: Description of the corpora Previous: Preprocessing

Description of the FSTs used

The first two unsupervised techniques we are presenting in this sub-section have shown their value in the clustering [14] and categorization area [25]. Particulary, the document frequency technique is an effective and simple technique, and it is known that it obtains comparable results to the classical supervised techniques like $ \chi^2$ and Information Gain [26]. With respect to the transition point technique, it has a simple calculation procedure, which has been used in other areas of computational linguistic besides clustering of short texts: categorization of texts, keyphrases extraction, summarization, and weighting models for information retrieval systems (see [19]). Therefore, we consider that there exists enough evidence to use this technique as a term selection process.

  1. Document Frequency (DF): This technique assigns the value $ df_t$ to each term $ t$, where $ df_t$ means the number of texts, in a collection, where $ t$ ocurrs. This technique assumes that low frequency terms will rarely appear in other documents, therefore, they will not have significance on the prediction of the class for this text.

  2. Term Strength (TS): The weight given to each term $ t$ is defined by the following equation:

    $\displaystyle ts_t = Pr(t \in T_i \vert t \in T_j),$   with $\displaystyle i \ne j,$

    Besides, both texts, $ T_i$ and $ T_j$ must be as similar as a given threshold, i.e., $ sim(T_i, T_j) \ge \beta$, where $ \beta$ must be tuned according to the values inside of the similarity matrix. A high value of $ ts_t$ means that the term $ t$ contributes to the texts $ T_i$ and $ T_j$ to be more similar than $ \beta$. A more detailed description can be found in [25] and [18].

  3. Transition Point (TP): A higher value of weight is given to each term $ t$, as its frequency is closer to a frequency named the transition point ($ TP_V$) which can be found by an automatic inspection of the vocabulary frequencies of each text, identifying the lowest frequency (from the highest frequencies) that it is not repeated; this characteristic comes from the formulation of Booth's law for low frequency words [6] (see [19] for a complete explanation of this procedure). The following equation shows how to calculate the final value:

    $\displaystyle idtp(t,T) = \frac{1}{\vert TP_V - freq(t,T)\vert+1}$

    where $ freq(t,T)$ is the frequency of the term $ t$ in the document $ T$.

The DF and TP techniques have a temporal linear complexity with respect to the number of terms of the data set. On the other hand, TS is computationally more expensive than DF and TP, because it requires to calculate a similarity matrix of texts, which implies this technique to be in $ O(n^2)$, where $ n$ is the number of texts in the data set.

next up previous
Next: Experimental results Up: Description of the corpora Previous: Preprocessing
David Pinto 2007-05-08